AI Is Getting Better at Diagnosing Sleep Disorders
Scoring a polysomnogram is tedious work. A single overnight study generates roughly 1,000 pages of data across multiple channels — EEG, EOG, EMG, ECG, airflow, respiratory effort, SpO2, body position. A trained sleep technologist spends 60 to 90 minutes manually reviewing and annotating each study. They’re looking at 30-second epochs, classifying each one into wake, N1, N2, N3, or REM, and flagging respiratory events, limb movements, and arousals.
It’s demanding, repetitive, and subject to inter-scorer variability that has long been an acknowledged weakness in the field. Two experienced technologists scoring the same study will disagree on roughly 20% of epochs, according to AASM reliability data. That’s not a failure of training — it’s a reflection of genuine ambiguity in the data.
This is exactly the kind of task where machine learning excels. And the results are getting hard to ignore.
Where AI Stands Right Now
Several validated deep learning models can now score sleep stages with accuracy comparable to expert human scorers. The inter-rater agreement between AI and human typically falls in the 80-85% range — which, notably, is about the same as the agreement between two human scorers.
The technology has moved beyond academic papers and into clinical tools. Companies have developed FDA-cleared AI-assisted scoring systems that are now being used in sleep labs. The models don’t just match human performance on straightforward epochs — they’re increasingly capable with the tricky ones, too: transitions between stages, mixed-frequency EEG patterns, and the subtle differences between N1 drowsiness and light N2 sleep.
For respiratory event detection, AI performance is arguably even stronger. Identifying apnoeas and hypopnoeas from airflow signals is a pattern recognition task with less subjective ambiguity than EEG-based sleep staging, and current models achieve sensitivity and specificity above 90% in most validation studies.
What This Actually Means for Clinicians
Let’s be realistic about what AI is and isn’t doing in sleep diagnostics today.
It’s accelerating the workflow. An AI-scored study still gets reviewed by a human technologist and interpreted by a board-certified sleep physician. But instead of scoring from scratch, the tech is reviewing and adjusting an AI-generated scoring — which typically cuts the time per study by 40-50%. In labs processing dozens of studies per week, that’s significant.
It’s improving consistency. One of the frustrations in sleep medicine has always been variability — between scorers, between labs, between the same scorer on a Monday morning versus a Friday evening. AI doesn’t get tired or distracted. Its outputs are consistent, which provides a stable baseline that human reviewers can then refine.
It’s enabling scalable screening. This is where things get really interesting. As home sleep testing expands and wearable devices generate more clinically relevant data, the bottleneck isn’t data collection — it’s interpretation. AI models that can triage and pre-process large volumes of sleep data will be essential for managing the growing demand for sleep diagnostics.
There are teams working on custom AI development specifically for healthcare applications, building models that can integrate data from multiple device types and flag the studies most likely to show pathology. This kind of intelligent triage could help sleep clinics prioritise urgent cases and reduce wait times, which in countries like Australia can stretch to months.
The Limitations Worth Acknowledging
AI in sleep medicine isn’t a solved problem. Several important caveats:
Paediatric scoring remains difficult. Most AI sleep scoring models are trained predominantly on adult data. Children’s sleep architecture is fundamentally different — their EEG patterns, respiratory norms, and scoring rules are distinct. Models trained on adults perform poorly on paediatric studies, and paediatric-specific training datasets are small.
Complex comorbidities challenge the algorithms. A patient with Parkinson’s disease, restless legs syndrome, and moderate sleep apnea produces polysomnographic data that’s genuinely hard to interpret, even for experienced clinicians. AI models tend to perform best on relatively straightforward presentations and struggle with these layered, complex cases.
The black box problem persists. Deep learning models can tell you that an epoch is N2 sleep, but they can’t explain why in terms a clinician can evaluate. This matters for quality assurance. When a human scorer makes a questionable call, you can discuss the reasoning. When an AI does, you’re trusting the model’s training without full interpretive transparency.
Looking Ahead
The trajectory is clear: AI will become a standard part of sleep medicine workflows within the next few years. Not as a replacement for clinical judgment, but as a tool that handles the mechanical aspects of data processing so that clinicians can focus on what they do best — contextualising findings within a patient’s history, making nuanced diagnostic decisions, and developing treatment plans.
The biggest potential impact may be in access. Sleep medicine already faces a shortage of qualified technologists and specialists in many regions. If AI can responsibly handle more of the scoring burden, it frees up limited human expertise for the cases that need it most.
We’re not at the point where an algorithm replaces a sleep specialist. But we’ve reached the point where ignoring AI’s role in sleep diagnostics means ignoring a tool that genuinely improves efficiency and consistency. For a field that’s been manually scoring 30-second epochs for decades, that’s welcome.