Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

Imagine you are training a student to become a doctor who can diagnose a specific disease from an X-ray. You show them thousands of pictures: some show sick patients, and some show healthy ones.

Ideally, the student learns to look at the lungs to find the disease. But, unfortunately, the student is a bit lazy and clever. They notice a "shortcut": Every time the patient is male, the X-ray machine used was a bit older and grainier, and all the male patients in your training set happened to have the disease.

So, instead of learning to look at the lungs, the student learns to say: "If the image looks grainy (like the old machine), the patient is sick."

This is called Shortcut Learning. In the real world, this is dangerous. If you show this student a picture of a sick female patient taken with a modern, crisp machine, they will fail because the "grainy" shortcut isn't there. They didn't learn the real medicine; they just memorized a coincidence.

The Problem: The "Clever Hans" Doctor

In medical AI, models often act like "Clever Hans" (a famous horse that seemed to do math but was actually just reading the trainer's body language). They find easy patterns in the data that aren't actually the cause of the disease.

The Real Cause: A tumor in the lung.
The Shortcut: The hospital logo in the corner, the patient's gender, or the type of scanner used.

When these models move to a new hospital with different equipment or different patients, they break because the "shortcut" patterns disappear.

The Solution: Untangling the Knot

The researchers in this paper wanted to teach the AI to stop cheating and actually learn the right things. They used a technique called Feature Disentanglement.

Think of the AI's brain as a messy room where all its knowledge is thrown into one big pile. It's hard to tell what is "disease knowledge" and what is "shortcut knowledge."

Disentanglement is like hiring a professional organizer to sort that room.
They split the room into two distinct, separate boxes:

Box A (The Task): Contains only information about the disease (the lungs).
Box B (The Confounder): Contains only information about the shortcut (the scanner type, gender, etc.).

The goal is to make sure Box A has zero information about the shortcut. If the AI tries to put a "grainy scanner" clue into Box A, the system yells, "No! That belongs in Box B!"

How They Tested It

The researchers didn't just guess; they ran a massive "Olympics" of different training methods using three types of data:

Digits: Numbers written in thin or thick lines (a simple test).
Chest X-rays: Checking for fluid in the lungs, where the shortcut was the patient's gender.
Eye Scans: Checking for eye disease, where they artificially added a "noise" shortcut.

They tested the models in three scenarios:

The Normal Test: Just like the training data.
The Balanced Test: The shortcut and the disease are mixed up randomly.
The "Inverted" Test (The Trap): The shortcut is reversed! (e.g., Now, the "grainy" images are actually healthy, and the "crisp" ones are sick). This is the ultimate test. If the AI is cheating, it will fail miserably here.

The Results: What Worked Best?

1. The "Data Fix" (Rebalancing):
Imagine you have too many pictures of "grainy sick men" and not enough "crisp sick women." You simply copy-paste more pictures of the rare groups to make the list fair.

Result: This helped a lot. It forced the AI to look harder. But it wasn't perfect.

2. The "Model Fix" (Adversarial Learning):
This is like a game of "Hide and Seek." You have a detective (the AI) trying to find the disease, and a trickster (an adversary) trying to hide the disease clues and force the AI to use the shortcut. The AI has to get so good at finding the disease that the trickster can't hide it anymore.

Result: Good, but sometimes the AI got confused and stopped learning anything useful.

3. The "Math Fix" (Disentanglement):
This is the professional organizer approach. They used math to physically separate the "disease box" from the "shortcut box."

Result: This was very effective at keeping the boxes separate.

4. The "Super Combo" (The Winner):
The researchers found that the best strategy was to combine the Data Fix (making the training list fair) with the Math Fix (forcing the AI to separate the boxes).

Why it wins: It's like giving the student a fair textbook and a strict teacher who checks their work. This combination made the AI robust. Even when the shortcut was reversed (the "Inverted Test"), this combo kept performing well, while the others crashed.

The Takeaway

The paper teaches us that to build safe medical AI, we can't just throw data at a computer and hope for the best. We have to be intentional.

Don't let the AI cheat: If it finds an easy shortcut, it will use it.
Sort the knowledge: Force the AI to separate "what matters" from "what doesn't."
Do both: Fix your data and fix your model architecture.

By doing this, we create AI doctors that don't just memorize the quirks of one hospital but actually understand the disease, making them safe to use in hospitals all over the world.

1. Problem Statement

Deep learning models in medical imaging often achieve high accuracy on training data but fail to generalize in clinical settings due to shortcut learning. Instead of learning causal, task-relevant features (e.g., pathological signs), models exploit spurious correlations or confounders (e.g., scanner artifacts, hospital-specific protocols, or demographic attributes) that are predictive in the training set but not causally linked to the target condition.

This reliance on shortcuts leads to:

Poor Out-of-Distribution (OOD) Generalization: Models fail when deployed across different institutions, populations, or acquisition conditions.
Unfairness: Models may rely on sensitive attributes (e.g., sex or race) correlated with disease prevalence.
Safety Risks: Invertible correlations (where the shortcut flips in the test set) can cause catastrophic performance drops.

The paper addresses the lack of systematic benchmarks comparing feature disentanglement methods—techniques that explicitly separate task-relevant information from confounding factors in the latent space—against other mitigation strategies.

2. Methodology

2.1 Experimental Framework

The authors evaluated methods in a multi-task classification setting with:

Primary Task ( $y_1$ ): The target medical diagnosis.
Confounder ( $y_2$ ): A spuriously correlated auxiliary task (e.g., patient sex, image style, or synthetic noise).
Training Data: Constructed with strong spurious correlations (95% of samples lie on the main diagonal of the co-occurrence matrix, meaning $y_1$ and $y_2$ are highly correlated).
Evaluation Distributions:
1. Original: Standard test set.
2. Balanced: $y_1$ and $y_2$ are uncorrelated.
3. Inverted: The correlation between $y_1$ and $y_2$ is reversed (the shortcut fails), serving as the primary stress test for robustness.

2.2 Datasets

Three datasets were used to cover artificial, radiology, and ophthalmology domains:

Morpho-MNIST: Digits (0–4 vs. 5–9) confounded by stroke thickness (thin vs. thick).
CheXpert: Pleural effusion detection confounded by patient sex.
OCT: Drusen detection confounded by a synthetic radial notch filter (simulating acquisition noise).

2.3 Mitigation Strategies Compared

The study benchmarked model-centric feature disentanglement methods against a data-centric baseline and a standard Empirical Risk Minimization (ERM) baseline.

Baseline (ERM): Standard training without mitigation.
Data-Centric: Rebalancing (Oversampling underrepresented classes in the contingency table to break correlations).
Model-Centric (Feature Disentanglement):
- Architecture: An encoder maps input $x$ to a latent space split into two subspaces: $z_1$ (task) and $z_2$ (confounder).
- Adversarial Learning (AdvCl): Uses a Gradient Reversal Layer (GRL) to train a confounder predictor, forcing the encoder to remove confounder information from the shared latent space.
- Distance Correlation (dCor): Explicitly minimizes the distance correlation between $z_1$ and $z_2$ to enforce statistical independence.
- Mutual Information Neural Estimation (MINE): Minimizes the lower bound of mutual information between subspaces using a neural estimator.
- Maximum Mean Discrepancy (MMD): Minimizes the kernel-based discrepancy between the distributions of $z_1$ and $z_2$ .
Hybrid: Combinations of Rebalancing + each model-centric method.

3. Key Contributions

Systematic Benchmark: The first large-scale, comparative evaluation of feature disentanglement methods specifically for medical shortcut mitigation across diverse datasets and correlation strengths.
Latent Space Analysis: Moves beyond standard classification metrics (AUROC) to evaluate disentanglement quality using:
- Subspace-label confusion matrices (k-NN accuracy).
- Qualitative scatter plots of latent subspaces.
- A "diagonal dominance" score to quantify separation.
Interaction Analysis: Investigates the synergy between data-centric (rebalancing) and model-centric (disentanglement) approaches.
Efficiency Trade-offs: Analyzes the computational cost (convergence time) versus disentanglement performance.

4. Key Results

4.1 Classification Performance (Robustness)

Baseline Failure: Standard ERM models collapsed on Inverted test distributions (e.g., CheXpert AUROC dropped from 79% to 46%), proving heavy reliance on shortcuts.
Mitigation Success: All mitigation methods improved robustness.
- Best Performer: dCor + Rebalancing consistently achieved the highest AUROC across all datasets and distributions, particularly on the Inverted test sets (e.g., +24% improvement over baseline on OCT).
- MINE: Also performed very well, often matching dCor+Rebal, but was computationally expensive.
- AdvCl: Performed moderately well but was less robust than dCor/MINE in high-correlation settings.
- MMD: Generally underperformed compared to other methods, sometimes failing to improve over the baseline.

4.2 Disentanglement Quality

Metrics: Measured by how well $z_1$ predicts $y_1$ while failing to predict $y_2$ (off-diagonal accuracy near 50% indicates perfect disentanglement).
Findings:
- Rebalancing alone improved AUROC but did not consistently achieve true disentanglement (confounder information remained in $z_1$ ).
- dCor + Rebalancing and MINE achieved the strongest separation, with off-diagonal accuracies near 50%.
- Qualitative: Scatter plots showed that successful methods (dCor+Rebal) collapsed the confounder structure in the task subspace, whereas failed methods showed clear clustering by confounder.

4.3 Impact of Correlation Strength

Shortcut reliance and the effectiveness of mitigation are non-linear. When training correlations were moderate (<85%), improvements were marginal.
Under strong correlations (95–98%), the performance gap between methods widened significantly, with dCor+Rebal and MINE showing massive gains (up to 30–50% AUROC improvement on inverted sets).

4.4 Computational Efficiency

MINE required significantly more training time (epochs and minutes) than other methods due to the alternating optimization of the MI estimator.
dCor + Rebalancing offered the best efficiency-robustness trade-off, converging faster than MINE while achieving comparable or superior disentanglement scores.
Hybrid Approach: Combining Rebalancing with disentanglement often accelerated convergence compared to using disentanglement alone.

5. Significance and Conclusion

Clinical Relevance: The study demonstrates that relying solely on classification metrics is insufficient; models can appear accurate while harboring dangerous shortcuts. Latent space analysis is crucial for verifying robustness.
Best Practice: The authors recommend a hybrid approach: combining data-centric rebalancing (to reduce initial correlation strength) with model-centric disentanglement (specifically Distance Correlation) to ensure the model learns causal features.
Limitations: The study focused on single confounders. Real-world medical data often involves multiple, interacting confounders. Additionally, MMD was found to be less effective, suggesting kernel choice is critical.
Future Directions: The authors suggest applying these methods to larger foundation models, exploring multi-confounder scenarios, and validating interpretability in longitudinal clinical settings.

Code Availability: The project code is publicly available at https://github.com/berenslab/medical-shortcut-mitigation.