The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

Published 2026-03-03

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to teach a robot to be a super-radiologist. Its job is two-fold:

Find the tumor: Draw a precise outline around a breast cancer lump on an MRI scan.
Predict the future: Look at that scan before treatment starts and guess if the chemotherapy will completely wipe out the cancer.

For a long time, scientists trained these robots using data from just one hospital. It's like teaching a student to drive only on a sunny day in a quiet parking lot. When you send that student out onto a rainy, busy highway in a different city, they crash.

The MAMA-MIA Challenge was a massive, real-world "driving test" designed to fix this. Here is the story of what happened, explained simply.

1. The Big Test: A Global Road Trip

The organizers gathered a huge dataset of MRI scans from 1,506 patients across the United States to train the AI models. This was the "driving school."

Then, they sent these trained models to a completely different "road": an external test set of 574 patients from three different hospitals in Europe (Spain, Poland, and Lithuania).

The Goal: See if the AI could handle different cameras, different lighting, different doctors, and different patient bodies without getting confused.
The Twist: They didn't just grade them on how well they drove; they also graded them on fairness. Did the AI work equally well for young women, older women, women with dense breasts, and women with less dense breasts? Or did it only work well for one specific group?

2. The Two Tasks: The "Easy" One and the "Hard" One

Task 1: The "Find the Blob" Game (Tumor Segmentation)

The Job: Draw a line around the tumor.
The Result: Success! The AI models were surprisingly good at this. Even when they moved from the US to Europe, they kept their cool.
The Analogy: Think of this like a game of "Where's Waldo?" The AI got really good at spotting Waldo (the tumor) even when the background changed from a beach to a forest.
The Catch: The AI still struggled with the "tricky" cases: tiny tumors, tumors that looked like fog (low contrast), or tumors hiding near breast implants. It's like trying to find a small, gray mouse in a pile of gray sand.

Task 2: The "Crystal Ball" Game (Predicting Treatment Response)

The Job: Look at the scan and say, "Will this patient's cancer disappear completely after chemo?"
The Result: Ouch. This was incredibly hard. Most AI models performed barely better than a coin flip.
The Analogy: This is like trying to predict the winner of a horse race just by looking at a photo of the horses standing in the stable. You can see the horses, but you can't see how they will run, how the jockey will ride, or how the track will feel.
The Reality Check: The paper concludes that looking at a single scan before treatment isn't enough to predict the future. The AI was essentially guessing.

3. The Fairness Scorecard: The "Equal Opportunity" Rule

This is the most important part of the paper. Usually, AI is judged on its average score.

The Old Way: If an AI is 99% accurate for young women but 1% accurate for older women, the average might still look "okay." But that's unfair and dangerous.
The MAMA-MIA Way: They introduced a Fairness Score. The AI had to be good for everyone, not just the average person.
The Trade-off: Some teams tried to boost their overall score by ignoring the "hard-to-diagnose" groups. The challenge penalized them for this. It forced the AI to be a "fair" doctor, ensuring it didn't leave vulnerable patients behind just to get a higher grade.

4. What Did We Learn? (The Takeaway)

Finding the tumor is getting solved: We are close to having AI that can reliably outline tumors across different hospitals and for different types of women.
Predicting the cure is still a mystery: We cannot yet reliably predict if chemo will work just by looking at a pre-treatment scan. The "Crystal Ball" is still foggy.
Fairness is non-negotiable: You can't just have a smart AI; you need a fair AI. If the AI works great for some but fails for others, it's not ready for the real world.
The "One-Size-Fits-All" approach fails: The paper showed that models trained on one type of data often stumble when they hit a new hospital. We need AI that is robust enough to handle the messy reality of the real world.

In a Nutshell

The MAMA-MIA Challenge was a reality check for medical AI. It proved that while we are getting very good at finding the problem (the tumor), we are still terrible at predicting the solution (the cure) using only a single picture.

More importantly, it taught us that in medicine, being "good on average" isn't good enough. An AI system must be fair and reliable for every patient, regardless of their age, background, or body type, or it simply cannot be trusted in a hospital.

1. Problem Statement

Breast cancer is the most frequently diagnosed malignancy in women, and Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is critical for tumor characterization and monitoring treatment response, specifically Pathologic Complete Response (pCR) following Neoadjuvant Chemotherapy (NAC).

However, existing Artificial Intelligence (AI) models for breast MRI face two critical limitations:

Lack of Generalizability: Most models are trained on single-center, homogeneous datasets and fail to perform robustly when deployed across different institutions, imaging vendors, and geographic regions (domain shift).
Lack of Fairness: Models are typically evaluated using aggregate metrics, which can mask performance disparities across demographic subgroups (e.g., age, menopausal status, breast density). This raises concerns regarding clinical safety and equity.

The MAMA-MIA Challenge was designed to address these gaps by creating a large-scale benchmark that jointly evaluates tumor segmentation and pCR prediction while explicitly measuring cross-continental generalization and subgroup fairness.

2. Methodology

A. Dataset and Benchmark Design

Training Cohort: 1,506 patients from 25 US institutions (aggregated from ISPY-1, ISPY-2, NACT, and DUKE datasets). Data includes pre-treatment T1-weighted DCE-MRI with expert annotations (tumor masks, bounding boxes) and metadata (age, menopausal status, tumor subtype).
Validation & Test Cohorts: 574 patients from three independent European centers (Poland, Lithuania, Spain). This setup enables rigorous assessment of cross-continental and cross-institutional generalization.
Tasks:
1. Task 1: Automatic segmentation of the primary breast tumor.
2. Task 2: Prediction of pCR (Yes/No) using only pre-treatment MRI.

B. Evaluation Framework

The challenge introduced a Unified Scoring Framework that balances predictive performance with fairness.

Composite Score ( $S$ ): $S = (1 - \lambda)S_p + \lambda S_f$ , where $\lambda = 0.5$ (equal weight).
Performance Metrics:
- Segmentation: Dice Similarity Coefficient (DSC) and Normalized Hausdorff Distance.
- Classification: Balanced Accuracy (BA).
Fairness Metrics: Evaluated across three variables: Age, Menopausal Status, and Breast Density.
- Fairness is measured by the disparity in performance (e.g., DSC or True Positive Rate) across subgroups within these variables. Lower disparity yields a higher fairness score.

C. Participating Methods

26 international teams participated. Key methodological trends included:

Segmentation (Task 1): Dominated by 3D nnU-Net variants (often with residual encoders) and 3D Vision Transformers. Strategies included multi-phase DCE input, self-supervised pretraining (e.g., masked autoencoders), and ensemble inference.
Prediction (Task 2): Approaches ranged from end-to-end 3D classification (using ResNets or Video CNNs) to pipelines that extracted features from segmentation networks for downstream classifiers (e.g., XGBoost).

3. Key Contributions

Large-Scale Benchmark: The first standardized benchmark to jointly evaluate breast tumor segmentation and pCR prediction with a focus on cross-continental generalization (US training vs. European testing).
Fairness-Aware Protocol: A novel evaluation framework that explicitly quantifies performance consistency across clinically relevant subgroups, moving beyond aggregate accuracy.
Comprehensive Analysis: A detailed comparative analysis of 26 teams, identifying trends in model design, generalization behaviors, and the trade-offs between accuracy and fairness.
Open Resources: Public release of datasets, evaluation code, and reporting guidelines to foster reproducible and equitable AI development.

4. Results

Task 1: Primary Tumor Segmentation

Performance: Top-performing teams (e.g., MIC, FME, ViCOROB) improved the baseline nnU-Net by 0.43–4.89% in DSC.
Generalization: Models generalized reasonably well to unseen European centers, though performance varied by tumor size and center.
Fairness: Top teams achieved consistent performance across all subgroups (age, density, menopausal status).
Failure Modes: Performance dropped significantly for small tumors, non-mass enhancements, and cases with low contrast or artifacts (e.g., implants). The gap between top and bottom teams was largest for small tumors.

Task 2: Pathologic Complete Response (pCR) Prediction

Performance: Results were significantly weaker than segmentation. Only one team (PM) achieved performance statistically distinguishable from random guessing ( $p < 0.05$ ). Most models failed to reliably identify pCR cases.
Fairness vs. Accuracy: A clear trade-off was observed. The team with the highest raw accuracy (PM) had the worst fairness score. The top-ranked teams (pimed-lab, FME, AI Strollers) achieved competitive accuracy with substantially improved fairness.
Calibration: Models exhibited systematic overconfidence and poor probability separation, suggesting that pre-treatment MRI alone lacks sufficient signal for reliable pCR prediction.

5. Significance and Future Directions

Clinical Readiness: The challenge demonstrates that while automated tumor segmentation is nearing clinical readiness for multi-center deployment, pre-treatment pCR prediction remains a significant unsolved problem. Current models cannot yet replace clinical judgment for treatment response prediction based solely on baseline MRI.
The Role of Fairness: The results highlight that fairness-aware evaluation can drastically alter leaderboard rankings, prioritizing models that are robust across diverse patient populations over those that simply maximize average accuracy.
Future Work:
- Data: Integration of longitudinal data (mid- and post-treatment scans) and additional modalities (T2, DWI).
- Targets: Moving from binary pCR labels to continuous targets (e.g., tumor volume change).
- Modeling: Exploring joint multi-task learning, uncertainty-aware frameworks, and self-supervised pretraining on larger medical datasets.

In conclusion, the MAMA-MIA Challenge serves as a "stress test" for medical AI, revealing that while segmentation is maturing, treatment response prediction requires more robust data, richer features, and careful consideration of fairness before clinical adoption.