Machine Learning Is Replacing Manual Sleep Scoring


If you’ve ever had a sleep study, you’ve probably never thought about what happens to the data afterward. You sleep (or try to sleep) while wired up to a dozen sensors, and eventually your doctor tells you whether you have sleep apnea, periodic limb movements, or some other condition.

What you don’t see is a sleep technologist sitting in front of a screen, manually reviewing your study in 30-second increments, classifying each segment as Wake, N1, N2, N3, or REM sleep. A typical eight-hour polysomnography recording contains roughly 960 of these 30-second epochs. Scoring one study takes a trained technologist between 90 minutes and three hours.

That process hasn’t fundamentally changed since Rechtschaffen and Kales published their manual in 1968. Nearly six decades of largely manual labor. But machine learning is finally changing the math.

The Inter-Scorer Problem

Here’s the dirty secret of sleep scoring: human scorers don’t agree with each other as much as you’d hope.

Inter-rater reliability for sleep staging hovers around 80-85% agreement. That means two fully trained technologists looking at the same 30-second epoch will disagree roughly 15-20% of the time. The disagreements cluster predictably — the transitions between N1 and Wake are notoriously ambiguous, and distinguishing N1 from N2 can be genuinely difficult even for experienced scorers.

This variability has real clinical consequences. A patient’s sleep architecture can look meaningfully different depending on which technologist scores their study. Their percentage of deep sleep (N3) might shift by 5-10%, which can change diagnostic impressions and treatment decisions.

We’ve essentially been running clinical medicine on a measurement system with a 15-20% error margin, and we’ve just accepted it because there was no better alternative.

What ML Models Can Do Now

Modern deep learning models — primarily convolutional neural networks and transformer architectures trained on large polysomnography datasets — now achieve agreement with expert consensus scores at rates comparable to or exceeding individual human scorers.

A few milestones worth noting:

The Stanford group’s work demonstrated that neural networks trained on thousands of sleep studies could score individual epochs with accuracy rivaling senior technologists. Their model’s agreement with expert consensus was 87%, which is within the range of human inter-scorer agreement.

More recent models have pushed into the 89-91% range when trained on larger, more diverse datasets. That’s not just matching humans — it’s exceeding what most individual scorers achieve.

And they do it in seconds, not hours. A study that takes a technologist two hours to score takes an ML model under a minute.

Where It Gets Complicated

Speed and accuracy sound great, but adoption in clinical sleep labs has been slower than the technology would suggest. Several factors explain this:

Regulatory pathways. The FDA has cleared some automated scoring systems, but the regulatory framework for AI in clinical diagnostics is still maturing. Labs want clarity on liability. If an AI misscores a study and it affects patient care, who’s responsible?

Trust and verification. Most sleep physicians want to review AI-scored studies rather than accept them blindly. That’s appropriate. But it means the workflow becomes “AI scores, human reviews” rather than “AI replaces human.” The time savings are real but not as dramatic as full automation would suggest.

Edge cases. ML models struggle with the same epochs humans struggle with, plus some that humans handle intuitively. Patients with significant artifact, unusual EEG patterns from medications or neurological conditions, or pediatric studies can trip up algorithms trained primarily on straightforward adult cases.

Firms specializing in this space are working on building models that handle these edge cases more gracefully, but it’s genuinely difficult territory.

The Workforce Angle

There’s a sleep technologist shortage in most countries. Training takes 1-2 years, the pay isn’t exceptional, and the work involves overnight shifts. Many sleep labs operate with skeleton crews.

Automated scoring doesn’t eliminate the need for sleep technologists — they still perform the studies, manage equipment, handle patient interactions, and troubleshoot technical issues during recording. But it does reduce the bottleneck of post-study scoring, which is often what limits a lab’s throughput.

A lab that can score studies in minutes rather than hours can turn around results faster, see more patients, and reduce the waiting lists that plague sleep medicine globally.

Looking Forward

The endgame here probably isn’t human versus machine. It’s human plus machine. AI handles the straightforward scoring — which represents 80-90% of epochs in a typical study — and flags the ambiguous segments for human review. The technologist becomes an editor rather than a writer, focusing their expertise where it matters most.

That’s a better use of skilled professionals’ time, and it produces more consistent results across studies. The patient whose sleep architecture looked different depending on which night they happened to be scored gets a more reliable assessment.

The transition will be gradual. But the direction is clear. Manual sleep scoring, as the standard clinical workflow, is approaching its final years.