Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

Imagine you are trying to teach a group of students how to identify and outline a specific, tiny room inside a very complex house (the brain). This room is called the Nucleus Accumbens, and it's important for understanding things like mood and addiction.

The researchers in this paper wanted to see if the "teachers" (the AI models) were fair to all the students, or if they had a hidden bias that made them better at teaching some groups of people than others.

Here is the breakdown of their study using simple analogies:

1. The Setup: Four Different "Teachers"

The researchers tested four different methods to teach the AI how to draw this tiny room:

The Three "Deep Learning" Teachers (UNesT, nnU-Net, CoTr): These are like modern, high-tech students who learn by looking at thousands of pictures and figuring out patterns on their own. They are very smart but can sometimes get stuck on specific details.
The "Atlas" Teacher (ANTs): This is the old-school method. Imagine a teacher who has a single, perfect blueprint (an atlas) of a house. To draw a new house, they try to stretch and fit that one blueprint onto the new building. It's a classic, reliable method, but it relies heavily on having the right blueprint.

2. The Problem: The "All-White" or "All-Black" Classroom

To test for bias, the researchers created a tricky scenario. Instead of giving the teachers a mixed class of students (Black men, Black women, White men, White women), they trained each teacher on only one specific group.

Teacher A only saw pictures of Black men.
Teacher B only saw pictures of White women.
And so on.

Then, they tested these teachers on everyone. The question was: Does a teacher who only studied White women do a bad job when asked to draw the room for a Black man?

3. The Findings: Who Failed the Test?

The "Old-School" Teacher (ANTs) Struggled Hard
This teacher was the most biased. It was like a tailor who only ever made suits for tall, thin men. When asked to make a suit for a shorter, broader person, the suit didn't fit at all.

When the Atlas teacher was trained only on Black subjects, it performed significantly worse than when trained on White subjects.
It essentially "forgot" how to draw the room correctly for people who didn't look like its training data.

The "High-Tech" Teachers Had Mixed Results

UNesT: This teacher was also quite biased. If it was trained on White people, it struggled with Black people. It seems to have memorized the specific "look" of the training group rather than learning the general concept of the room.
nnU-Net: This was the superstar. It was the only teacher that didn't care who the student was. Whether the test subject was Black, White, male, or female, nnU-Net drew the room just as accurately. It was like a master chef who can cook a perfect meal regardless of the specific ingredients they were given, because they truly understand the principles of cooking, not just the recipe.

4. The Hidden Danger: The "Volume" Trap

The researchers didn't just check if the drawing was accurate; they also measured the size of the room.

The Truth: In real life (measured by human experts), there are actual size differences between men and women, and between different racial groups.
The AI Lie: When the biased AI models (like the Atlas and UNesT) measured the room, they erased the racial differences. They reported that Black and White brains were the same size, even though they weren't.
The Consequence: If a doctor uses a biased AI to diagnose a patient, they might miss a real medical issue because the AI "smoothed over" the differences. It's like a scale that is broken and always says you weigh the same as your neighbor, even if you are actually much heavier or lighter.

5. The Big Takeaway

The paper teaches us three main lessons:

Data is Diet: If you feed an AI a diet of only one type of food (one demographic), it will get sick (biased) when asked to handle a different type of food.
Not All AI is Created Equal: Just because a model is "Deep Learning" doesn't mean it's fair. Some architectures (like nnU-Net) are naturally better at ignoring irrelevant differences (like race) and focusing on the important anatomy. Others (like the Atlas method) are very fragile.
Fairness Matters for Health: If we don't fix these biases, we risk creating medical tools that work great for some people but fail for others, potentially leading to misdiagnoses for minority groups.

In short: To build a fair medical AI, we need to train it on a diverse "classroom" of people, and we need to choose the "teachers" (algorithms) that are smart enough to learn the general rules of anatomy rather than just memorizing the faces of the students they were taught with.

1. Problem Statement

Deep learning (DL) has revolutionized medical image analysis, particularly in structural delineation for MRI. However, these models often exhibit demographic bias, where performance disparities arise based on sensitive attributes such as race and sex. While fairness in classification tasks is well-studied, fairness in segmentation remains under-explored.

The Gap: Existing segmentation bias studies often focus on a single DL architecture or rely on "silver standard" (automated) labels rather than high-quality manual ground truth.
The Risk: Biased segmentation can lead to inaccurate volumetric measurements of brain structures (e.g., the Nucleus Accumbens), potentially causing misdiagnosis or exacerbating health disparities in clinical applications.
Objective: To systematically evaluate and compare the demographic bias of multiple deep learning architectures against a traditional non-deep learning method when segmenting the left and right Nucleus Accumbens (NAc) across four demographic subgroups (Black Male, Black Female, White Male, White Female).

2. Methodology

Dataset

Source: Human Connectome Project (HCP) Young Adult dataset.
Demographics: Four subgroups: Black Female (BF), Black Male (BM), White Female (WF), White Male (WM).
Data Split:
- Training: 30–33 subjects per subgroup (total ~126).
- Testing: 19–20 subjects per subgroup.
Ground Truth: Manually annotated segmentations of the left and right NAc by a neuroanatomist (Gold Standard).
Preprocessing: T1-weighted MRIs (0.7mm isotropic) registered to MNI space.

Models Evaluated

The study compared three state-of-the-art Deep Learning models and one traditional Atlas-based method:

UNesT: A hierarchical transformer-based encoder with a convolutional decoder.
nnU-Net: An adaptive framework that automatically configures preprocessing and training pipelines.
CoTr: A hybrid model combining CNNs for feature extraction and Deformable Transformers (DeTrans) for long-range dependencies.
ANTs (Multi-Atlas Label Fusion): A traditional non-DL method using Joint Label Fusion with weighted voting.

Experimental Design: "Biased Training"

To isolate the impact of demographic imbalance, the authors trained four separate models for each architecture, where each model was trained exclusively on data from a single demographic subgroup (e.g., an "nnU-Net-BF" model trained only on Black Females). These were then tested on all four subgroups.

Evaluation Metrics

Segmentation Accuracy:
- Dice Similarity Coefficient (DSC): Overlap-based metric.
- Normalized Surface Dice (NSD): Boundary-based metric (more sensitive to edge errors).
Fairness Metrics:
- ESSP (Equity-Scaled Segmentation Performance): A metric that penalizes the overall accuracy based on the deviation of performance across demographic groups. Higher is better.
- $\Delta$ (Delta): The sum of absolute performance discrepancies between groups. Lower is better.
Statistical Analysis:
- Linear Mixed Models (LMM): Used to quantify the effect of "Same Race" and "Same Sex" matching between training and test data on DSC/NSD scores.
- Volumetric Analysis: LMMs applied to derived volumes to see if demographic effects (Race/Sex) observed in manual segmentation persist in automated outputs.

3. Key Contributions

Comparative Framework: First study to jointly compare deep learning architectures (UNesT, nnU-Net, CoTr) against a traditional atlas-based method (ANTs) regarding demographic bias in brain MRI.
Gold Standard Evaluation: Utilized high-quality, manually curated labels for the NAc, avoiding the noise often present in automated "silver standard" labels used in prior bias studies.
Dual-Perspective Analysis: Evaluated both segmentation performance fairness (accuracy across groups) and downstream morphometric impact (whether biased segmentation preserves or erases biological volume differences).
Baseline Experiments: Conducted additional experiments with balanced datasets and varying atlas sizes to distinguish between bias caused by dataset size vs. demographic composition.

4. Key Results

Segmentation Performance & Fairness

nnU-Net: Demonstrated the most robust and fair performance. Its accuracy remained consistent regardless of whether the training and test demographics matched. It achieved the highest ESSP and lowest $\Delta$ across all groups.
CoTr: Showed high accuracy and fairness, comparable to nnU-Net, though slightly more sensitive to race matching in NSD metrics.
UNesT: Performed well overall but showed significant race bias. Models trained on White subjects generally outperformed those trained on Black subjects. Crucially, UNesT performance dropped significantly when training and testing races did not match.
ANTs: Exhibited the highest bias. Models trained on Black subjects (BM/BF) showed drastically lower accuracy (lower DSC/NSD) and higher $\Delta$ compared to those trained on White subjects. ANTs was highly sensitive to the demographic composition of the atlas set.

Impact of Demographic Matching

Race Matching: For UNesT and ANTs, training and testing on the same race significantly improved Dice scores ( $p < 0.05$ ). nnU-Net showed no such dependency.
Sex Matching: Sex matching had no statistically significant effect on segmentation accuracy for any model, suggesting race is a more dominant source of bias than sex in this context.

Volumetric Analysis (Downstream Impact)

Manual Segmentation: Confirmed known biological trends: Significant volume differences based on Sex (Females > Males) and Race (White > Black) in the NAc.
Automated Segmentation:
- Sex Effects: The sex-based volume differences observed in manual labels were preserved across all automated models, even the biased ones.
- Race Effects: The race-based volume differences observed in manual labels disappeared in all automated models except one (CoTr trained on Black Females).
- Implication: Biased models systematically "flatten" racial differences, potentially masking real biological disparities or clinical biomarkers related to race.

Dataset Size & Balance

Increasing the size of a balanced dataset (e.g., 120 subjects with equal representation) significantly reduced bias in UNesT.
However, for ANTs, simply increasing the atlas size did not guarantee fairness; performance disparities persisted if the atlas composition was not carefully managed.

5. Significance and Conclusion

Architecture Matters: The choice of segmentation algorithm significantly influences fairness. nnU-Net appears to be the most equitable choice for this task, likely due to its aggressive, adaptive data augmentation strategies that force the model to learn generalizable anatomical features rather than demographic artifacts.
Bias is Subtle but Critical: While models may preserve some biological signals (sex differences), they can erase others (race differences), leading to a false sense of homogeneity in clinical data.
Data Quality is Paramount: The study underscores that diverse, balanced training datasets are essential. Relying on imbalanced data leads to models that perform poorly on underrepresented groups (specifically Black subjects in this study) and fail to capture true biological variance.
Future Directions: The authors call for systematic bias analysis in model development, the use of diverse benchmark datasets, and the implementation of mitigation strategies (e.g., sensitive class-aware augmentation or synthetic data generation) to ensure equitable brain MRI segmentation across all populations.

Limitations: The study is restricted to the NAc (a small subcortical structure) and a specific age range (22–35, healthy adults). Generalizability to other structures, age groups (children/elderly), or clinical populations (e.g., psychiatric disorders) requires further investigation.