Lack of Consensus for Manual Mouse Sleep Scoring Limits… — Plain-Language Explanation

Original authors: Rose, L., Zahid, A. N., Ciudad, J. G., Egebjerg, C., Piilgaard, L., Soerensen, F. L., Andersen, M., Radovanovic, T., Tsopanidou, A., Nedergaard, M., Arthaud, S., Maciel, R., Peyron, C., Berteotti, C.

Published 2026-03-30

📖 5 min read🧠 Deep dive

View on bioRxiv ↗PDF ↗

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Sleep Detective" Problem

Imagine you are trying to teach a computer to be a Sleep Detective. Its job is to look at brain waves (like a seismograph for the brain) and tell you exactly when a mouse is awake, when it's in a deep sleep, and when it's dreaming (REM sleep).

For decades, scientists have been trying to build the perfect "Sleep Detective" robot. They've built four very smart robots (called AI models) that are supposed to do this job automatically. The hope was that these robots could replace human experts, saving time and making sure every lab around the world gets the same results.

But here is the problem: When these robots were sent to different laboratories to do their job, they failed miserably. A robot that was a genius in Lab A would become confused and make mistakes in Lab B.

This paper asks: Why are these smart robots failing, and how do we fix them?

Analogy 1: The "Accent" Problem (Signal Variability)

Imagine you are teaching a robot to understand English. You train it only on people with a British accent. The robot becomes a master at understanding British English.

Now, you send that robot to a room full of people with American, Australian, and Scottish accents. Even though they are all speaking English, the robot gets confused because the "sounds" (the brain waves) are slightly different.

In this study, the "accents" are the different ways labs record mouse brain waves.

Lab 1 uses different electrodes than Lab 2.
Lab 3 uses a different type of mouse than Lab 4.
The hardware is different everywhere.

The researchers found that the robots were so specialized in the "accent" of the lab they were trained in, they couldn't understand the "accents" of other labs.

Analogy 2: The "Subjective Judge" Problem (Label Noise)

Here is the twist: It's not just the robot's fault. The human judges (the experts scoring the sleep) aren't agreeing with each other either!

The researchers gathered 10 expert sleep judges from 5 different labs. They gave them the exact same 9 mouse recordings and asked them to score them.

The Result: The experts disagreed significantly, especially on REM sleep (dreaming).
The Analogy: Imagine a movie critic panel. One expert says, "This scene is a Comedy," while another says, "No, it's a Drama." If you train a robot to learn from these critics, the robot gets confused. "Should I call this a Comedy or a Drama?"

The study found that even experts from the same lab didn't always agree on the same recording. This is called "Label Noise." The robots were trying to learn a rulebook that didn't actually exist because the humans couldn't agree on the rules.

The Solution: The "Potluck" Strategy

The researchers realized that training a robot on data from just one lab was like teaching a chef to cook only one family's recipe. The chef can't cook for anyone else.

What did they do?
They created a "Potluck" dataset. They gathered sleep data from five different laboratories and mixed it all together. They then re-trained the four robots on this diverse, mixed-up data.

The Result:

Before: The robots were like specialists who only spoke one dialect.
After: The robots became polyglots. They learned to understand many different "accents" and different scoring styles.
The Finding: The robots became much better at guessing the sleep stages in new, unseen labs.

Key Takeaway: It didn't matter if they fed the robots more data; it mattered that the data was diverse. A small amount of data from many different places was better than a huge amount of data from just one place.

The "Hypnodensity" (The Cloud of Uncertainty)

Usually, sleep scoring is black and white: "This minute is Wake," "This minute is Sleep."

But the researchers introduced a new way to look at the data called Hypnodensity.

Analogy: Instead of saying "It is definitely raining," a Hypnodensity says, "There is a 70% chance of rain, a 20% chance of drizzle, and a 10% chance of sun."

This is helpful because sleep isn't always black and white. Sometimes a mouse is in a "twilight" state, transitioning between sleep and wakefulness. The AI models can show this "cloud of uncertainty," which is actually more accurate than forcing a single label.

The Final Conclusion

The paper concludes with two main messages:

Stop building new robots; fix the rules. The biggest problem isn't that our AI models aren't smart enough. The problem is that we humans can't agree on what "sleep" looks like. We need a standardized rulebook (like the one humans use for human sleep) that every mouse lab follows.
Diversity is key. Until we have perfect rules, the best way to build a reliable sleep robot is to train it on data from many different labs, so it learns to handle the messiness of real life.

In short: We can't automate sleep scoring perfectly yet because humans can't agree on the basics. But if we mix our data from all over the world, we can build robots that are "good enough" to help us all move forward together.

1. Problem Statement

Despite decades of research, automatic sleep staging for mice has not successfully replaced manual scoring across different laboratories. While deep learning models (e.g., SPINDLE, SS-ANN, Grieger, SlumberNet) report high accuracy (often >95%) on internal datasets, they fail to generalize to external data from different labs. The authors hypothesize that this failure is driven by two primary factors:

Signal Variability: Differences in hardware, electrode placement, and biological factors (genetics, age, sex) across labs.
Label Noise: A lack of standardized scoring guidelines leads to significant inter- and intra-scorer variability, particularly for REM sleep. Models trained on noisy labels learn specific lab biases rather than universal biological features, creating a "ceiling effect" on performance.

2. Methodology

The study employed a three-pronged approach involving data collection, model reproduction, and rigorous evaluation:

Data Collection:
- Cohorts A–E: Aggregated data from 5 different laboratories comprising 83 wild-type mice. These datasets were used to test generalizability.
- Cohort F: A unique dataset of 9 recordings (from 5 mice) scored independently by 10 experts (2 from each of the 5 labs). This was designed specifically to quantify manual scoring variability.
Model Selection & Reproduction:
- Four state-of-the-art (SOTA) deep learning models were selected based on availability and architecture: SPINDLE (CNN on spectrograms), SS-ANN (Mixture z-scoring), Grieger (Time-series CNN), and SlumberNet (ResNet-based).
- The authors reproduced these models using the original hyperparameters and preprocessing pipelines.
Experimental Design:
1. Baseline Generalization: Tested original models (trained on single-lab data) on the 5 external cohorts using a Leave-One-Lab-Out (LOLO) approach.
2. Diversity vs. Size Optimization: Retrained models on diverse data from 4 labs to test on the 5th. Two conditions were compared:
  - Fixed n: Same sample size as original training but from 4 diverse labs.
  - All n: All available data from 4 diverse labs.
3. Consensus Analysis: Analyzed the 10 expert annotations of Cohort F to calculate Cohen's $\kappa$ (agreement) and assess the impact of label noise on downstream metrics (bout length, power spectra).
4. Hypnodensity & Calibration: Compared the probability outputs (hypnodensities) of the retrained models against the manual expert consensus to evaluate calibration and uncertainty.

3. Key Contributions

Quantification of Label Noise: The study provides empirical evidence that manual sleep scoring in mice lacks consensus, with significant variability even among experts within the same lab.
Demonstration of Generalization Failure: It proves that SOTA models trained on single-lab data fail catastrophically when applied to external labs, particularly for REM sleep detection.
Data Diversity over Volume: The study demonstrates that training on diverse data (multiple labs) is more critical for generalization than simply increasing the volume of data from a single source.
Standardized Robust Models: The authors released four retrained, robust models trained on diverse datasets, serving as a standardized baseline for the field.
Call for Standardization: The paper argues that the next step in automation is not better algorithms, but standardized scoring guidelines and hardware setups.

4. Key Results

A. Poor Generalization of Baseline Models

When tested on external labs, the recall for Wakefulness varied wildly (e.g., SlumberNet dropped to 12.1% in one lab).
REM Sleep was the most poorly generalized stage across all models. For example, SPINDLE's REM recall ranged from 50.3% to 84.1% across the 5 labs.
High recall in specific labs often correlated with a model's tendency to over-predict that specific stage (class imbalance bias).

B. Impact of Training Data Diversity

Retraining models on diverse data (from 4 labs) significantly improved performance on the held-out 5th lab compared to baseline models ( $p < 10^{-6}$ ).
Fixed n (Diverse) vs. All n (Large & Diverse): While training on larger datasets showed a marginal improvement, the difference was not statistically significant. This confirms that data diversity is the primary driver of generalizability, not dataset size.

C. Manual Scoring Variability (Label Noise)

Agreement: Within-lab agreement was high for Wakefulness ( $\kappa \approx 0.93$ ) but lower for REMS ( $\kappa \approx 0.78 - 0.93$ ).
Between-Lab Variability: Agreement dropped further between labs, highlighting that scoring criteria differ significantly across institutions.
Downstream Impact: Variability in scoring significantly affected bout length analysis (start/end of sleep episodes) more than total time spent in stages.
REM Sleep: REMS showed the highest variability, suggesting current guidelines are insufficient for this stage.

D. Model Calibration and Hypnodensity

SPINDLE, SS-ANN, and Grieger showed high prediction certainty when experts agreed, but they tended to underestimate Wakefulness and overestimate sleep stages (NREM/REM).
SlumberNet was overly conservative, struggling to predict NREM/REM even when experts agreed.
Hypnodensity: The study utilized hypnodensity (probability distributions) to visualize mixed states. The models' uncertainty often aligned with areas where experts disagreed, suggesting that "mixed" stages are a biological reality obscured by discrete manual labeling.

5. Significance and Conclusion

The paper fundamentally shifts the narrative on automated sleep scoring:

The Bottleneck is Human, Not Algorithmic: The inability of models to generalize is largely due to the lack of a gold-standard consensus in manual scoring (label noise), not just signal variability.
Standardization is Prerequisite: Before developing more complex architectures, the field must establish standardized scoring guidelines (especially for REM and transitions) and hardware protocols.
Robust Tools: The authors provide four retrained models that serve as a "standardized tool" for the community, offering better cross-lab performance than previous SOTA models.
Future Direction: The use of hypnodensity (probabilistic outputs) is advocated over discrete classification to capture biological mixed states and model uncertainty, which could lead to new biomarkers in disease models (e.g., narcolepsy).

In summary, the study concludes that standardizing the "ground truth" (manual scoring) is the critical first step toward fully automating mouse sleep staging, rather than solely focusing on model architecture improvements.

Lack of Consensus for Manual Mouse Sleep Scoring Limits Implementation of Automatic Deep Learning Models