Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

Imagine you are training a new doctor to read Chest X-rays. You want them to be a master at spotting every possible disease, from the very common ones (like a simple cold) to the extremely rare ones (like a specific bone defect that only happens once in a million people).

The problem? You don't have enough training data for the rare diseases. In fact, for some diseases, you have zero examples at all.

This paper describes how a team of researchers built an AI system to solve this exact problem for the CXR-LT 2026 Challenge. They had to tackle two different "boss battles" using two different strategies.

Here is the breakdown of their solution, explained simply:

The Two Big Problems

The "Popular vs. Obscure" Problem (Task 1):
Imagine a library where 99% of the books are about "The Common Cold," but there is only one book about "Rare Jungle Fever." If you train a student by just reading the library, they will become an expert on colds but will fail miserably at Jungle Fever. In medical terms, this is called a Long-Tailed Distribution. The AI naturally ignores the rare diseases because it sees them so rarely.
The "Ghost Disease" Problem (Task 2):
Now, imagine the student is asked to identify a disease they have never seen before and for which they have zero textbooks. They can't study it; they just have to guess based on what they know about anatomy and language. This is called Zero-Shot Learning.

The Solution: Two Different Toolkits

The team built two separate "brains" to handle these two problems.

🛠️ Toolkit 1: The "Fair Teacher" (For Task 1)

Goal: Make the AI pay attention to the rare diseases without forgetting the common ones.

The Analogy: Imagine a teacher who notices the students are only studying the popular chapters. To fix this, the teacher gives the rare chapters extra credit and forces the students to spend more time on them.
How they did it:
- Reweighting: They told the AI, "If you get a rare disease right, you get a huge reward. If you get a common one right, it's just a small reward." This forces the AI to care about the rare stuff.
- Sampling: They made the AI look at images with rare diseases more often, almost like making the student read the rare book five times while only reading the common book once.
- The "Normal" Check: They added a safety net. If the AI is 99% sure the X-ray is "Normal," it automatically lowers the scores for all diseases. This stops the AI from hallucinating diseases where there are none.
- The Ensemble: They didn't just use one AI; they trained two slightly different versions and asked them to vote on the answer. It's like asking two experts for a second opinion to be sure.

Result: Their "Fair Teacher" AI became the #1 ranked team for spotting diseases in the known list, especially the rare ones.

🛠️ Toolkit 2: The "Translator" (For Task 2)

Goal: Identify diseases the AI has never seen, using only text descriptions.

The Analogy: Imagine you have never seen a "Goiter" (a swollen neck gland), but you know what a "swollen neck" looks like and you can read a description of a Goiter. Instead of showing the AI pictures of Goiters, you give it a text description and ask, "Does this X-ray look like the description of a Goiter?"
How they did it:
- They used a special AI called WhyXrayCLIP. Think of this as a translator that speaks both "Image" and "Medical Text."
- They taught this translator by showing it millions of X-rays paired with doctors' written reports. It learned that the visual pattern of a "Bulla" (a bubble in the lung) matches the words "air-filled space" or "bubble."
- The Test: When they needed to identify a new disease (like Scoliosis), they didn't show the AI any pictures of Scoliosis. Instead, they fed it text prompts like "curvature of the spine." The AI compared the X-ray image to the text description. If they matched well, it said, "Yes, this looks like Scoliosis."

Result: Their "Translator" AI was the #1 ranked team for identifying diseases it had never seen before, proving you can teach an AI about new things just by giving it the right vocabulary.

The Final Scorecard

The researchers tested their system on a public leaderboard:

Task 1 (Common & Rare Diseases): They scored 0.583, beating the second-place team by a significant margin.
Task 2 (Ghost Diseases): They scored 0.467, crushing the competition (the second place was only 0.365).

Why This Matters

In the real world, hospitals don't have perfect data. They have thousands of images of common issues and very few of rare ones. Sometimes, a new disease appears, and no one has labeled it yet.

This paper shows that by using smart weighting (to fix the imbalance) and text-based learning (to handle the unknown), we can build AI doctors that are not just good at what they've seen, but are also ready for the unexpected. They are moving from "memorizing the textbook" to "understanding the concept."

1. Problem Statement

The paper addresses the challenge of imperfect supervision in clinical Chest X-Ray (CXR) classification, specifically focusing on two critical issues:

Extreme Long-Tailed Distribution: In real-world datasets, common diseases (head classes) dominate, while clinically important but rare abnormalities (tail classes) appear infrequently. Standard models tend to bias toward head classes, failing to detect rare conditions.
Missing Annotations for Unseen Findings: Rare or previously unseen diseases often lack labeled training data, making supervised learning impossible for these categories.

The authors tackle these issues within the context of the CXR-LT 2026 Challenge, which utilizes a PadChest-based benchmark with:

30 In-Distribution (ID) classes: For long-tailed multi-label classification.
6 Out-of-Distribution (OOD) classes: For zero-shot recognition (no training labels provided).

2. Methodology

The authors propose distinct, task-specific solutions for the two challenges, combining robust training strategies with inference-time refinements.

Task 1: Long-Tailed Multi-Label Classification

The goal is to classify 30 ID findings under severe class imbalance.

Architecture: Uses ConvNeXtV2-Base as the backbone, initialized with weights pre-trained on MIMIC-CXR. Two variants are fine-tuned:
1. A standard MLP classification head.
2. A CSRA (Class-Specific Spatial Attention) head to focus on relevant image regions.
Imbalance-Aware Training:
- Distribution-Balanced (DB) Loss: Combines class-dependent reweighting based on "effective number" statistics (to up-weight rare classes) with a margin adjustment for positive labels.
- Class-Aware Sampling (CAS): Uses a repeat-factor sampler to oversample images containing rare positive labels, ensuring the model sees tail classes more frequently without distorting the overall data distribution.
Inference Pipeline:
- Ensemble & TTA: Combines predictions from two checkpoints using weighted averaging. Applies Test-Time Augmentation (TTA) including horizontal flips, slight rotations ( $\pm5^\circ$ ), and zooms.
- Normal Gating (Post-processing): A lightweight refinement step where the probability of the "Normal" class ( $p_0$ ) suppresses abnormal class scores: $p_c \leftarrow p_c \cdot (1 - p_0)^{\alpha_{ng}}$ . This reduces false positives when the image is confidently normal.

Task 2: Zero-Shot OOD Recognition

The goal is to predict 6 unseen disease categories without any supervised labels or examples from those classes during training.

Model: Utilizes WhyXrayCLIP, a vision-language model based on OpenCLIP (ViT-L/14) fine-tuned on MIMIC-CXR image-report pairs. This allows the model to align radiographic features with textual descriptions.
Prompt Ensembling: Instead of a single text label, multiple generic radiological text descriptions (prompts) are defined for each OOD category. These are encoded into text embeddings.
Zero-Shot Scoring:
- The model computes cosine similarity between the image embedding and the average of the K prompt embeddings for each OOD class.
- Scores are mapped to probabilities using a scaled sigmoid function to sharpen the separation between low and high similarity cases.
- No task-specific supervision is used; the model relies entirely on the pre-trained semantic alignment between images and text.

3. Key Contributions

Dual-Regime Framework: A unified approach addressing both long-tailed multi-label classification and zero-shot OOD detection within the same imaging domain.
Imbalance Mitigation Strategy: The combination of Distribution-Balanced Loss and Class-Aware Sampling effectively improves tail class recognition while maintaining head class performance.
Zero-Shot via Vision-Language: Demonstrates that specialized vision-language models (WhyXrayCLIP) can achieve strong zero-shot performance on rare CXR findings by leveraging text prompts as class prototypes, bypassing the need for labeled OOD data.
Post-Processing Innovation: The introduction of "Normal Gating" effectively reduces spurious positive predictions for abnormal classes when the image is likely normal.

4. Results

The method was evaluated on the CXR-LT 2026 public development leaderboard using Macro-averaged Mean Average Precision (mAP) as the primary metric.

Task 1 (Long-Tailed Classification):
- Rank: 1st place.
- mAP: 0.583 (outperforming the 2nd place by 0.048).
- mAUC: 0.919.
- mF1: 0.376 (Best in the top 10).
- Note: The mECE (calibration error) was high (0.928), indicating room for improvement in probability calibration.
Task 2 (Zero-Shot OOD Recognition):
- Rank: 1st place.
- mAP: 0.467 (surpassing the 2nd place by 0.102).
- mAUC: 0.779 (Best in the top 10).
- mECE: 0.516 (2nd best among top 10).

5. Significance

This work is significant for the advancement of AI in medical imaging because:

Clinical Realism: It addresses the "long-tail" problem inherent in real-world medical data, where rare diseases are critical but underrepresented.
Scalability: The zero-shot approach allows for the detection of new or rare diseases without the costly and time-consuming process of collecting and annotating large datasets for every new condition.
State-of-the-Art Performance: By securing the top rank on both tasks of a challenging benchmark, the proposed methods set a new baseline for handling supervision scarcity in CXR analysis.
Open Science: The authors have released their code and pre-trained models, facilitating reproducibility and further research in long-tailed and zero-shot medical learning.

The paper concludes that while calibration needs improvement, the proposed strategies successfully balance performance across frequent and rare findings and enable robust generalization to unseen disease categories.