PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis

The Big Picture: The "EEG School" Problem

Imagine you are trying to teach a robot to understand human brain waves (EEG). To do this, you have to "train" the robot on a massive library of brain recordings.

For a long time, scientists have been training these robots using libraries that only contain recordings from Europe and North America. They assumed that if the robot learned enough from these specific libraries, it would be smart enough to understand anyone's brain, anywhere in the world.

The Problem: This is like teaching a chef to cook only Italian food using ingredients from a specific Italian grocery store. If you then ask that chef to cook a traditional Indian dish using local Indian spices, they might fail. They learned the "Italian way" of cooking, not the universal rules of flavor.

This paper introduces PRISM, a new way of training these brain-reading robots to see if changing who and where the data comes from actually makes the robot smarter.

The Experiment: Two Different Schools

The researchers built two "schools" to train their AI model (PRISM). Both schools used the exact same curriculum (the same math and architecture), but they had different student bodies (data sources).

The "Narrow" School (D1): This school only used data from the standard European and American archives (TUH and PhysioNet). It's like a school where every student wears the same uniform, speaks the same dialect, and uses the same type of microphone.
The "Diverse" School (D2): This school included the same data as the Narrow School, PLUS thousands of new recordings from South Asian hospitals. These recordings came from different types of machines, different hospitals, and people with different genetic backgrounds and lifestyles.

The Three Big Discoveries

1. The "Test Score" Trap

When the researchers tested the robots on simple, standard tests (like recognizing sleep stages or motor movements), the Narrow School robot often got higher scores if the test was taken in the same environment it was trained in.

The Analogy: Imagine a student who memorized the exact answers to a practice test. If the real test looks exactly like the practice test, they ace it. But if the real test asks the same question in a different way, they might fail.
The Finding: The Narrow robot was good at "memorizing" the specific patterns of Western data. However, when the researchers let the robot "fine-tune" (learn more deeply) for a new task, the Diverse School robot caught up and often became better. It had learned the underlying rules of brain waves, not just the specific "accent" of the Western recordings.

2. The "Hard Test": Epilepsy vs. Mimics

The researchers created a brand new, very difficult test: Can the robot tell the difference between a person having Epilepsy and a person having a "fake" seizure (caused by stress or fainting) just by looking at their brain waves when they aren't having a seizure?

This is a nightmare for human doctors. About 25% of people diagnosed with epilepsy are actually misdiagnosed.

The Result: The Diverse School robot crushed this test. It was 12.3% more accurate than the Narrow School robot.
The Analogy: The Narrow robot was like a detective who only knows how to spot a thief in a specific neighborhood. The Diverse robot was like a detective who has seen thieves in many different neighborhoods, wearing different clothes, using different tools. When the "thief" (the disease) showed up in a new context, the Diverse robot recognized the real pattern, not just the familiar surroundings.

3. The "Ruler" Problem (Why Comparisons are Broken)

The paper also found that the way scientists currently compare these AI models is broken. There are two major "scoreboards" (benchmarks) in the field: EEG-Bench and EEG-FM-Bench.

The Problem: These two scoreboards use different rules. One might cut the brain waves into 3-second chunks, the other into 4-second chunks. One might pick the "best" version of the model, the other picks the "last" version.
The Analogy: Imagine two sports leagues measuring the same basketball player. League A measures height in inches, League B in centimeters. League A says the player is "Tall," League B says "Short." If you switch the rules, the player's ranking flips completely!
The Finding: The researchers showed that by changing just six small rules (like how you cut the data or how you normalize it), you could make a "bad" model look "good" and a "good" model look "bad." This means we can't trust current rankings until everyone agrees on the rules.

The "Scale vs. Diversity" Surprise

A common belief in AI is: "More data is always better."
The famous model REVE was trained on 92 different datasets (over 60,000 hours of data).
PRISM was trained on only 3 datasets (but one of them was very diverse).

The Shock: PRISM (the small, diverse team) performed just as well, or better, than REVE (the giant, massive team) on most tasks.
The Lesson: It's not about how many books you read; it's about what kind of books you read. Reading 92 books about the same topic doesn't make you smarter than reading 3 books that cover the whole world. Targeted diversity is more powerful than random scale.

Why This Matters for You

Better Medical Care: If we only train AI on data from rich, Western countries, it will fail to diagnose patients in India, Africa, or South America. This paper proves that including diverse data makes the AI safer and more accurate for everyone.
Stop the "Fake News" in Science: The paper calls for scientists to agree on a single, fair way to test these models. Otherwise, companies might claim their AI is the "best" just because they used a specific testing trick.
Quality over Quantity: We don't need to collect millions of hours of identical data. We need to collect different kinds of data to build truly robust AI.

In a Nutshell

The authors built a brain-reading AI and proved that teaching it with a mix of people from all over the world makes it a better doctor than teaching it with only data from Europe and America. They also showed that the current way we grade these AI models is like using different rulers for different students—it's time to standardize the rules so we can find the truly smartest models.

1. Problem Statement

The paper addresses two critical gaps in the development of EEG Foundation Models (FMs):

Representation Ambiguity: Existing FMs are predominantly pretrained on narrow-source datasets (primarily TUH and PhysioNet from Europe/US). It is unclear whether these models learn true neural physiology or merely memorize recording artifacts and population-specific distribution biases.
Benchmark Inconsistency: Standard evaluation frameworks (EEG-Bench and EEG-FM-Bench) yield divergent results and even reverse model rankings on identical datasets due to methodological differences in preprocessing, splitting, and evaluation protocols.
Clinical Gap: No foundation model has been rigorously evaluated on the clinically difficult task of distinguishing epilepsy from diagnostic mimickers (e.g., Psychogenic Non-Epileptic Seizures - PNES) using routine interictal EEG, a task where misdiagnosis rates are high.

2. Methodology

The authors introduce PRISM (Population-Representative Invariant Signal Model), a controlled ablation study designed to isolate the impact of pretraining population diversity.

A. Architecture

PRISM is a Masked Autoencoder (MAE) based on the REVE architecture, featuring:

4D Positional Encoding: Incorporates spatial electrode coordinates $(x, y, z)$ and temporal patch index $t$ to handle arbitrary montages without retraining.
Encoder-Decoder: A 12-layer transformer encoder and 4-layer decoder.
Training Objective: Uses a primary L1 reconstruction loss on masked patches and an auxiliary loss that forces a global embedding (derived via attention pooling of intermediate layers) to reconstruct masked patches. This encourages information distribution across encoder depths.
Masking: Spatio-temporal block masking (55% ratio) covering 3cm spatial and 3s temporal radii.

B. Experimental Design (Controlled Ablation)

Two checkpoints were trained with identical architecture and hyperparameters, differing only in pretraining data composition:

D1 (Narrow-Source): TUH Corpus + PhysioNet Motor Imagery (Standard EU/US distribution).
D2 (Multi-Source): D1 augmented with 9,663 subjects (4,170 hours) of multi-center clinical recordings from South Asian institutions. This introduces heterogeneity in geography, demographics, and acquisition systems (Natus, Nihon Kohden, BPL).

C. Evaluation Strategy

Benchmarks: Six standard tasks (Alzheimer's, Motor Imagery, Sleep Staging, Pathology, Cognitive Load).
Novel Clinical Task: A new dataset of 200 South Asian subjects (100 Epilepsy, 100 Mimickers) for Interictal Differential Diagnosis.
Adaptation Regimes: Linear Probing (LP), Full Fine-tuning (Single-stage and Dual-stage), and Partial Fine-tuning.
Head Ablation: Comparison of Attention Pooling, Average Pooling, and MLP classification heads.

3. Key Contributions & Results

A. Population Heterogeneity vs. Scale

Linear Probing vs. Fine-tuning: Under linear probing (frozen encoder), the narrow-source model (D1) often outperforms D2 on distribution-matched benchmarks. However, under full fine-tuning, the diverse model (D2) matches or exceeds D1 on 5/6 tasks.
Scale vs. Diversity: PRISM (trained on 3 corpora) matches or outperforms REVE (trained on 92 datasets, 60k+ hours) on the majority of tasks. This demonstrates that targeted data diversity can substitute for indiscriminate data scale, and dataset count is a confounding variable in model comparisons.

B. The Clinical Breakthrough (Epilepsy vs. Mimickers)

On the novel task of distinguishing epilepsy from mimickers (PNES, syncope) using interictal EEG:
- D2 (Diverse) achieved 60.7% balanced accuracy.
- D1 (Narrow) achieved 48.4% balanced accuracy.
- Result: A +12.3 percentage point (pp) improvement for the diverse model. This is the largest performance gap observed across all evaluations, suggesting that diverse pretraining forces the model to disentangle neural pathology from acquisition artifacts, creating more robust representations for subtle clinical distinctions.

C. Benchmark Inconsistency Analysis

The authors identified six specific methodological differences between EEG-Bench and EEG-FM-Bench that cause ranking reversals (up to 24 pp):

Split Construction: Subject-level vs. segment-level splits (causing data leakage in EEG-Bench).
Checkpoint Selection: Best validation vs. last checkpoint.
Segment Length: 3s vs. 4s windows.
Normalization: Different preprocessing pipelines.
Classification Head: Variable head choices across papers.
Compound Interactions: These factors interact non-additively, making it impossible to attribute performance differences to a single cause without standardization.

D. Classification Head Sensitivity

The study found that the MLP head significantly outperformed pooling methods (Attention/Average) on most tasks (e.g., +13.3 pp on BCI-IV-2a), indicating that reconstruction-based pretraining encodes local morphology that requires nonlinear re-projection for classification, rather than global discriminative features.

4. Significance and Implications

Redefining "Better Data": The paper challenges the prevailing assumption that "more data" (specifically more EU/US datasets) is always superior. It argues that population heterogeneity is a more critical factor for clinical generalization than raw scale.
Clinical Utility: The results prove that foundation models pretrained on diverse populations are better suited for high-stakes clinical differential diagnosis, where standard benchmarks fail to capture the nuance of misdiagnosis.
Call for Standardization: The authors provide a systematic decomposition of why current benchmarks are unreliable. They call for a community consensus on evaluation protocols (strict subject splits, fixed checkpoints, standardized heads) to ensure fair model comparison.
Resource Release: The authors are releasing the first public interictal EEG dataset specifically curated for epilepsy vs. mimicker differential diagnosis, alongside the PRISM code, to catalyze clinically grounded research.

Conclusion

PRISM demonstrates that population diversity in pretraining data is a decisive factor in the robustness of EEG foundation models, particularly for complex clinical tasks. The study reveals that current benchmark inconsistencies obscure true model capabilities and that targeted diversity can outperform massive, homogeneous datasets. The work advocates for a shift from scaling laws based on dataset count to scaling laws based on demographic and acquisition diversity.