Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

Imagine you are a detective trying to solve a mystery inside a giant, 3D library. This library is a CT scan of a human chest, but instead of books, it's made of hundreds of thin slices of images (like pages in a book). Your job is to look at the whole library and decide: Is the person healthy? Do they have COVID? Or do they have one of two types of lung cancer?

The catch? The "clues" (the sick parts of the lung) are tiny and hidden in just a few pages, while most of the library is filled with healthy pages. Also, the library has a secret bias: it has way more stories about men than women for certain diseases, and the AI detective might accidentally learn to guess based on whether the story sounds "male" or "female" rather than looking at the actual clues.

This paper describes a smart new system built to solve this mystery fairly and accurately. Here is how they did it, broken down into simple concepts:

1. The Problem: Finding a Needle in a Haystack

The Challenge: A CT scan has 100 to 800 slices. If a patient has a small tumor, it might only show up in 5 of those slices.

The Old Way: Imagine averaging the opinion of every single page in the library. If 95 pages say "Healthy" and 5 say "Sick," the average says "Healthy." The AI misses the disease.
The New Way (Attention-MIL): Instead of listening to everyone equally, the AI learns to be a smart librarian. It uses an "Attention Mechanism" to figure out which specific pages are important. It learns to ignore the boring, healthy pages and focus its energy on the few pages with the tumor. It's like having a highlighter that automatically marks the most critical sentences in a book.

2. The Bias Problem: The "Gender Shortcut"

The Challenge: The training data had very few cases of a specific cancer in women (Squamous Cell Carcinoma). Because there were so few examples, the AI got lazy. It started guessing "Male" or "Female" based on the shape of the lungs or how the scan was taken, rather than looking at the disease. This is called a "shortcut." If the AI relies on these shortcuts, it might be great at diagnosing men but terrible at diagnosing women.

The Solution (The Adversarial Game):
The researchers added a second, mischievous AI inside the main system.

The Main AI tries to diagnose the disease.
The Mischievous AI tries to guess the patient's gender based on the Main AI's notes.
The Trick: They use a "Gradient Reversal Layer" (GRL). This is like a referee that punishes the Main AI if the Mischievous AI can guess the gender.
The Result: The Main AI is forced to scrub all gender clues out of its notes. It has to learn to diagnose the disease purely based on the lung tissue, not the patient's gender. It's like forcing a judge to make a decision without knowing the defendant's name or background.

3. The Data Imbalance: The "Rare Book" Problem

The Challenge: In the training library, there were hundreds of "Male" cancer books but only a handful of "Female" cancer books. If you just read the library randomly, you'd almost never see the rare female cases, so the AI would never learn how to spot them.

The Solution (The "Highlighter" Strategy):

Focal Loss: This is a special scoring system. It tells the AI, "Don't worry about the easy cases you already know; focus your brainpower on the hard, rare cases you keep getting wrong."
Oversampling: The researchers manually made sure that the rare "Female Cancer" cases appeared in the training mix much more often than they naturally occurred. It's like a teacher making sure a student practices the hardest math problems every single day, not just the easy ones.

4. The Final Exam: The "Super-Panel"

The Challenge: Even with a great system, one single model might get lucky or unlucky on a specific day.
The Solution:

Ensemble: They trained five slightly different versions of the AI (like five different expert doctors).
Voting: When a new patient comes in, all five doctors look at the scan. They don't just pick the winner; they take a "soft vote," combining their confidence levels to make a final decision.
Mirror Trick (TTA): They also looked at the scan in a mirror (flipped horizontally) to make sure the AI wasn't confused by the orientation of the image.

The Result

By combining these strategies, the team built a system that:

Finds the needle: It ignores the healthy lung tissue to find the tiny tumors.
Is fair: It diagnoses men and women with equal accuracy because it was forced to ignore gender clues.
Handles the rare cases: It specifically trained harder on the rare female cancer cases so it wouldn't fail them.

In a nutshell: They built a super-smart, fair-minded AI doctor that knows how to ignore distractions, focus on the tiny details that matter, and treat every patient equally, regardless of their gender. This is a huge step toward making AI safe and reliable for real-world hospitals.

1. Problem Statement

The paper addresses the challenge of automated multi-class lung disease diagnosis from 3D Chest CT volumes. The specific task, defined by the Fair Disease Diagnosis Challenge (CVPR 2026), involves classifying scans into four categories: Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma (SCC).

The core difficulty lies in two intersecting constraints:

Volumetric Signal Sparsity: A CT scan contains 100–200+ slices, but pathological signals (e.g., nodules) often appear in only a few slices. Standard aggregation methods (like mean pooling) dilute these sparse signals, while max pooling is sensitive to artifacts.
Demographic Imbalance & Fairness: The evaluation metric is the average of per-gender macro F1 scores ( $P = \frac{1}{2}(F1_{male} + F1_{female})$ ). This penalizes models that perform well on one gender but poorly on the other. The dataset exhibits severe intersectional imbalance, specifically a scarcity of Female SCC cases (only 18 female scans vs. 91 male scans in the SCC class).
Latent Gender Bias: Deep learning models may inadvertently encode gender as a latent feature (via body morphology or acquisition parameters), leading to spurious correlations that degrade fairness.

2. Methodology

The authors propose a Fairness-Aware Attention-Based Multiple Instance Learning (MIL) framework built on a ConvNeXt backbone. The pipeline consists of the following components:

A. Architecture: Attention-Based MIL

Instead of treating the CT volume as a single image or using simple pooling, the model treats the volume as a "bag" of 2D axial slices ( $S = \{x_1, ..., x_N\}$ ).

Feature Extraction: A ConvNeXt-Base encoder (pretrained on ImageNet) extracts embeddings ( $h_i$ ) for each slice.
Attention Mechanism: A two-layer MLP assigns an importance weight ( $w_i$ ) to each slice based on its diagnostic relevance. The final scan representation ( $H$ ) is a weighted sum of slice embeddings: $H = \sum w_i h_i$ . This allows the model to focus on pathological slices while ignoring healthy background slices without requiring slice-level annotations.

B. Adversarial Fairness (Gradient Reversal Layer)

To prevent the model from using gender as a shortcut:

A Gradient Reversal Layer (GRL) is attached to the scan embedding $H$ .
A secondary gender classifier is trained to predict gender from $H$ .
During backpropagation, the GRL reverses the gradient, forcing the encoder to learn features that are predictive of disease but uninformative of gender.

C. Fairness-Aware Training Protocol

Loss Function: Uses Focal Loss combined with Label Smoothing to handle class imbalance and prevent overconfidence, particularly for rare classes.
Stratified Cross-Validation: 5-fold CV is performed with stratification on the joint key (Class, Gender) to ensure every fold contains representative samples of the rare Female SCC subgroup.
Subgroup Oversampling: A WeightedRandomSampler is used to ensure the underrepresented Female SCC group appears in nearly every training batch.
Two-Stage Fine-Tuning: The backbone is frozen initially to stabilize the attention and adversarial heads, then unfrozen for fine-tuning.

D. Inference Strategy

Ensembling: Predictions from all 5 cross-validation folds are combined.
Test-Time Augmentation (TTA): Horizontal flipping is applied to test volumes.
Soft Logit Voting: Final predictions are derived by averaging logits from all folds and TTA views.
Post-Hoc Threshold Optimization: To address calibration issues in rare subgroups, class-specific decision thresholds are optimized using Out-of-Fold (OOF) predictions to maximize binary F1 without data leakage.

3. Key Contributions

End-to-End Attention-MIL: An architecture that learns to identify diagnostically relevant slices from scan-level labels alone, solving the signal sparsity problem.
Adversarial Fairness Mechanism: The integration of a GRL to explicitly disentangle disease pathology from gender features in the latent space.
Comprehensive Fairness Protocol: A training strategy combining stratified splits, focal loss, and targeted oversampling to address both class-level and subgroup-level (intersectional) imbalances.
Robust Inference: A combination of 5-fold ensembling, TTA, and OOF-based threshold optimization to maximize stability and fairness.

4. Results

Performance: The model achieved a mean validation competition score of 0.685 (±0.030), with the best single fold reaching 0.759.
Fairness Gap: The adversarial training successfully reduced the performance gap between genders. The mean Female Macro-F1 (0.691) slightly exceeded the Male Macro-F1 (0.679), demonstrating that the model no longer relies on gender biases.
Class Performance: While Adenocarcinoma and COVID-19 were well-classified, Squamous Cell Carcinoma (SCC) remained the most challenging class (Mean F1: 0.366) due to its extreme scarcity and visual overlap with other opacities.
Ablation Study: The study confirmed that moving from Mean Pooling $\to$ Max Pooling $\to$ Attention-MIL improved signal detection. Crucially, adding Subgroup Oversampling and the GRL was necessary to prevent the collapse of the Female SCC class and to close the fairness gap.

5. Significance

This work demonstrates that achieving demographic fairness in medical AI requires explicit, multi-layered methodological interventions rather than just dataset curation.

It proves that adversarial debiasing (via GRL) can effectively remove latent gender correlations without sacrificing diagnostic accuracy.
It highlights the necessity of Attention-MIL for volumetric medical data where pathology is sparse.
The paper provides a blueprint for handling intersectional data scarcity (e.g., Female + Rare Disease) through targeted oversampling and OOF threshold optimization, offering a robust approach for clinical deployment where equity is a prerequisite.