Brain predictive models of cognition fail to generalize across ethnicities: Modality-dependent bias in MRI-based prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a robot to recognize faces. You show it thousands of photos of people from one specific neighborhood. The robot gets really good at recognizing people from that neighborhood. But when you show it a photo of someone from a completely different neighborhood, it gets confused and makes mistakes.

This is exactly what happens with AI models that try to predict human intelligence or cognitive skills based on brain scans.

This paper is a big investigation into whether these "brain-reading" robots are fair to everyone, or if they are secretly biased against certain groups. The researchers used a massive database of brain scans from teenagers (the ABCD study) to test this.

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "One-Size-Fits-All" Trap

The researchers found that most brain models are trained on data that is mostly White American.

The Analogy: Imagine a chef who only ever cooks with ingredients from a specific local market. They become a master chef for that market's food. But if you ask them to cook a dish using ingredients from a different continent, they struggle because they've never learned how those ingredients behave.
The Result: When these models were tested on African American participants, they performed worse. They were "blind" to the specific brain patterns of this group because they hadn't seen enough of them during training.

2. The Experiment: Four Different "Cooking Classes"

To fix this, the researchers tried four different ways to train their models (the "chefs"):

The "All" Class: Using all available data (mostly White). Result: Good for White people, bad for others.
The "White Only" Class: Using only White people, but matching the number of people to the African American group. Result: Still biased toward White people.
The "Black Only" Class: Using only African American people. Result: Good for African Americans, bad for White people.
The "Balanced" Class: Using an equal number of White and African American participants. Result: This was the winner. It didn't lose accuracy for White people, but it significantly improved fairness for African American people.

3. The Big Surprise: Not All Brain Scans Are Created Equal

The researchers looked at different types of brain scans (like looking at the brain's structure vs. how it works while doing a task). They found that some types of scans are naturally more fair than others.

Structural Scans (sMRI): These look at the shape and size of the brain (like measuring the height of a building).
- The Analogy: This is like trying to judge a book by its cover, but the cover design was invented by a specific culture. If the book is written in a different style, the cover doesn't match the template. These scans were the most biased.
Task-Based Scans (fMRI): These look at the brain while it is doing a job (like solving a puzzle).
- The Analogy: This is like watching someone actually cook the meal. It doesn't matter what the kitchen looks like; you can see how they handle the ingredients. These scans were the most fair.

Key Takeaway: If you want a fair prediction, don't just look at the brain's "blueprint" (structure); watch how the brain "works" (function).

4. The "More Data" Myth

A common belief is: "If we just add more data from underrepresented groups, the AI will magically become fair."

The Reality: The researchers found a "diminishing returns" effect.
The Analogy: Imagine you are trying to balance a scale. Adding African American participants helps balance the scale up to the point where you have 50% White and 50% Black. But if you keep adding more Black participants (oversampling) without adding White ones, the scale tips the other way, and the model gets worse at predicting for White people.
The Solution: The sweet spot is Balance. Once you have an equal mix, adding more of one group doesn't help; it might even hurt.

5. The "Super-Model" Didn't Fix It

The researchers tried combining all the different brain scans into one "Super Model" (Multimodal Stacking) to make it smarter.

The Result: The Super Model became more accurate overall, but it did not become fairer. It was still biased.
The Lesson: Just because a model is smarter or more complex doesn't mean it treats everyone equally. You have to design it specifically for fairness.

Summary: What Should We Do?

This paper is a wake-up call for the future of "Precision Medicine" (using AI to treat individuals).

Don't just dump all data together: If your data is 90% one group, your model will be 90% biased toward that group.
Balance your classes: The best way to fix bias isn't to invent complex algorithms; it's to simply ensure your training data has an equal number of people from different backgrounds.
Choose the right tool: Use brain scans that measure activity (what the brain is doing) rather than just structure (what the brain looks like) if you want fair results.

In short: To build a brain model that works for everyone, we can't just teach it with a few examples of everyone. We have to teach it with a balanced, equal classroom of students from all backgrounds, using the right kind of tests.

1. Problem Statement

Predictive neuroimaging models hold promise for precision medicine but risk exacerbating health inequities if they perform unevenly across ethnic and racial groups. While genomic studies have documented that models trained on European-ancestry cohorts fail to generalize to other populations, the scope and mechanisms of similar biases in neuroimaging remain poorly characterized.

Core Issue: Large-scale neuroimaging datasets (e.g., ABCD Study) are predominantly composed of White participants. Models trained on these imbalanced datasets may learn associations that are unstable or irrelevant for underrepresented groups (specifically African Americans).
Open Questions:
1. Does ethnic bias vary across different MRI modalities (structural, functional, diffusion)?
2. Do phenotypes with stronger brain-behavior associations generalize more equitably?
3. Can strategies like multimodal stacking or synthetic oversampling mitigate these disparities?

2. Methodology

Data Source & Participants

Dataset: Adolescent Brain Cognitive Development (ABCD) Study, Release 5.1.
Cohort: Focused on White American (WA) and African American (AA) participants.
Matching: Participants were drawn from the ABCD-BIDS Community Collection (ABCC-2.0), where groups were matched across 21 sites based on 21 variables including age, sex, socioeconomic status, and clinical history to ensure comparability.
Outcome Variable: NIH Toolbox total cognitive composite score (chosen for high reliability and stability).

Neuroimaging Phenotypes
The study benchmarked 91 distinct neuroimaging phenotypes:

80 Unimodal Phenotypes:
- Structural MRI (sMRI): Cortical thickness, surface area, volume, subcortical volumes, and tissue intensity (Destrieux atlas).
- Diffusion Tensor Imaging (DTI): Fractional anisotropy (FA) in 23 white matter tracts.
- Task-based fMRI: 56 contrast sets from N-back, Monetary Incentive Delay (MID), and Stop Signal Task (SST), parcellated using both Destrieux and Glasser atlases.
- Functional Connectivity (FC): Resting-state and task-based connectivity (Gordon and Glasser atlases).
11 Multimodal Phenotypes: Stacked models integrating predictions from the unimodal sets (e.g., "Stacked All," "Stacked Task Contrasts").

Training Strategies
Four distinct training strategies were compared to isolate the effect of sample composition:

All: Trained on the full matched dataset (majority White).
RandWA-only: Trained on a random subset of White participants, matched in size to the AA sample.
AA-only: Trained exclusively on African American participants.
Balanced (AA+RandWA): Trained on all AA participants plus an equal number of randomly selected White participants.

Modeling & Evaluation

Algorithm: Partial Least Squares (PLS) regression for unimodal models; Random Forest (late fusion) for multimodal stacked models.
Metric: Mean Absolute Error (MAE) was the primary metric for prediction error.
Bias Quantification: An Ethnicity Bias Index was calculated as the difference in MAE gaps between the RandWA-only and AA-only models. Lower absolute values indicate less bias.
Sampling Analysis: Incremental sampling was performed to test the effect of increasing AA representation from 0% to 75%, including synthetic oversampling beyond the 50% balance point.

3. Key Results

A. Modality-Dependent Bias

Structural MRI (sMRI) & DTI: Exhibited the highest bias. Models trained on White data performed significantly better on White test participants, and vice versa. Even balanced training did not fully eliminate the performance gap for AA participants in these modalities.
Task-based fMRI & Functional Connectivity: Exhibited lower bias. Specifically, task contrasts (e.g., N-back 2-0back) and functional connectivity phenotypes showed comparable prediction errors across AA and WA groups when trained on balanced data.
Atlas Influence: Phenotypes parcellated with the Glasser atlas (multimodal, functionally defined) showed significantly lower bias than those using the Destrieux atlas (anatomically defined by folding patterns), suggesting that anatomical templates may embed structural biases.

B. Performance vs. Fairness

Correlation: There was a moderate positive correlation ( $r \approx 0.62$ ) between predictive accuracy (lower MAE) and lower bias. Phenotypes with stronger brain-cognition associations generalized more equitably.
Multimodal Stacking: While stacking improved overall predictive accuracy, it did not improve fairness. Multimodal models retained intermediate bias levels, often favoring the majority group (White) in "All" and "Balanced" training scenarios.

C. Impact of Training Composition

Balanced Sampling: Training on equal-sized AA and White subsamples emerged as the optimal strategy. It reduced disparities without sacrificing accuracy for the majority group.
Diminishing Returns: Increasing AA representation beyond the 50% balance point (via synthetic oversampling) yielded no further improvements in AA prediction accuracy and, in some cases (sMRI), degraded performance for White participants.

4. Key Contributions

First Modality-Wide Benchmark: This study provides the first systematic evaluation of ethnic bias across the full spectrum of MRI modalities (sMRI, DTI, fMRI task, fMRI FC) for cognitive prediction.
Identification of Modality-Specific Vulnerabilities: It establishes that structural phenotypes are highly susceptible to training sample composition, whereas functional and task-based phenotypes are more robust and equitable.
Validation of Balanced Sampling: The study demonstrates that balanced subsampling is a "ceiling" for fairness in the absence of new data collection, outperforming complex algorithmic interventions like oversampling or multimodal stacking.
Atlas Dependency: It highlights that the choice of brain atlas (functional vs. anatomical) significantly impacts cross-ethnic generalizability, likely due to how pre-processing pipelines handle population-specific anatomical variations.

5. Significance & Implications

Clinical Translation: Deploying predictive neuroimaging models without addressing ethnic bias risks reinforcing health disparities. The findings suggest that "one-size-fits-all" models trained on majority-White data are ethically and scientifically insufficient for diverse populations.
Methodological Shift: The results argue against relying solely on increasing dataset size or synthetic oversampling. Instead, researchers should prioritize:
- Balanced training designs during model development.
- Feature selection favoring functional/task-based phenotypes over structural ones for equitable prediction.
- Development of population-specific templates and atlases to reduce pre-processing biases inherent in standard tools (e.g., MNI152, Destrieux).
Future Directions: The paper calls for the development of inclusive pre-processing pipelines and the recognition that accuracy gains do not automatically equate to fairness. The "Balanced AA+RandWA" strategy is proposed as the practical upper bound for equitable prediction in current neuroimaging frameworks.

Brain predictive models of cognition fail to generalize across ethnicities: Modality-dependent bias in MRI-based prediction

1. The Problem: The "One-Size-Fits-All" Trap

2. The Experiment: Four Different "Cooking Classes"

3. The Big Surprise: Not All Brain Scans Are Created Equal

4. The "More Data" Myth

5. The "Super-Model" Didn't Fix It

Summary: What Should We Do?

1. Problem Statement

2. Methodology

3. Key Results

4. Key Contributions

5. Significance & Implications

More like this

From nodes to pathways: an edge-centric model of brain function-structure coupling via constrained Laplacians

Excitation-inhibition balance controls coupling stability and network reorganization in a plastic Kuramoto model

Disinhibition of a recurrent attractor gates a persistent goal signal for navigation

Uncovering dynamic human brain phase coherence networks

Mitochondrially Transcribed dsRNA Mediates Manganese-induced Neuroinflammation