Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

Imagine you are running a massive hospital screening center. Every day, thousands of patients walk in for Chest X-rays. Your goal is to build a super-smart computer assistant that can look at these X-rays and spot 30 different diseases at once.

Here's the catch: The diseases are not equal.

Some diseases, like a slightly enlarged heart, are super common (let's say 1,000 cases a day). Others, like a rare lung collapse, are incredibly rare (maybe only 2 cases a day). This is what scientists call a "Long-Tailed" problem.

If you just teach a computer to be "average," it will get really good at spotting the common stuff but will completely ignore the rare stuff because it barely sees any examples of it. In medicine, missing a rare but deadly disease is a disaster.

This paper is a report card on how the author, Nikhileswara Rao Sulake, tried to fix this problem to win a global competition (the CXR-LT 2026 Challenge). Here is the story of what they did, explained simply.

1. The Problem: The "Popular Kid" vs. The "Quiet Kid"

Think of the diseases as students in a classroom.

The Head Classes (Popular Kids): These are the common diseases. They raise their hands all the time. The computer learns them easily.
The Tail Classes (Quiet Kids): These are the rare diseases. They sit in the back and rarely speak up. If the computer only listens to the "Popular Kids," it will never learn the "Quiet Kids" exist.

The challenge was to make the computer care just as much about the quiet kids as the popular ones.

2. The Solution: Three Tools in the Toolbox

The author tested three main things to fix the imbalance:

A. The "Teacher's Grading System" (Loss Functions)

In school, if a student gets an easy question right, they get a small reward. If they get a hard question right, they get a huge reward.

Old Way: The computer treated every disease the same. It got bored with the rare ones because they were hard to find.
The New Way (LDAM-DRW): The author used a special "grading system" called LDAM-DRW.
- Phase 1: At first, the computer studies everything normally.
- Phase 2: Once it knows the basics, the teacher says, "Okay, now we are going to focus only on the quiet kids." The computer gets extra points for spotting the rare diseases.
- Result: This was the biggest game-changer. It forced the computer to pay attention to the rare stuff.

B. The "Brain Power" (Architecture)

Imagine trying to solve a puzzle.

Old Brains (ResNet, DenseNet): These are like a standard pair of glasses. They work okay, but they struggle with complex, rare patterns.
New Brains (ConvNeXt): The author tried a newer, more powerful design called ConvNeXt. Think of this as upgrading from a bicycle to a high-tech sports car. It has more "muscle" (parameters) and a better design to understand complex images.
Result: The "Sports Car" (ConvNeXt-Large) won the race. It was the best at spotting the rare diseases on its own.

C. The "Second Opinion" (Post-Training Strategies)

Even with a good brain and a good grading system, the computer sometimes hesitates.

Re-training the Head: Imagine the computer learns the shapes of the diseases (the body), but then you swap out its decision-making part (the head) and train that specifically on the rare cases. This helped it make better final decisions.
Test-Time Augmentation (TTA): This is like looking at the X-ray, then flipping it upside down, rotating it slightly, and looking again. If the computer sees the disease in all those different angles, it becomes more confident.
Ensembling: This is like asking three different doctors for their opinion and taking the average. It made the ranking (who is the best doctor?) better, but didn't always help with the specific details.

3. The Results: A Bumpy Ride

The author entered a competition with 68 teams.

The Good News: Their system was 5th best in the world! They beat almost everyone else.
The Bad News: There was a gap between their practice tests and the real exam.
- In practice, they were great (0.52 score).
- On the real test, their score dropped (0.39 score).
- Why? The computer was really good at ranking diseases (saying "This is the most likely disease"), but it was bad at deciding if a disease was actually present or not (the "Yes/No" decision). It was like a student who knows the answers but forgets to circle the right bubble on the test sheet.

4. The Takeaway for the Real World

This paper teaches us two big lessons for the future of AI in hospitals:

Don't ignore the rare stuff: If you want AI to work in real medicine, you must use special techniques (like LDAM-DRW) that force the AI to care about rare diseases.
Confidence isn't enough: Just because an AI can rank diseases correctly doesn't mean it can diagnose them correctly. We need to teach the AI to be more confident and accurate in its "Yes/No" decisions, not just its "Maybe" guesses.

In a nutshell: The author built a smart computer that learned to listen to the quiet students in the classroom. It became one of the best in the world at spotting rare lung diseases, but it still needs a little more training to stop second-guessing itself when it's time to make the final call.

1. Problem Statement

The paper addresses the critical challenge of long-tailed class distribution in multi-label Chest X-Ray (CXR) classification.

Context: In clinical datasets, common conditions (e.g., cardiomegaly) appear frequently, while rare but clinically vital pathologies (e.g., pneumothorax) are severely underrepresented.
Challenge: Standard deep learning models and loss functions (like Binary Cross-Entropy) tend to bias towards "head" classes (common diseases), failing to recognize rare "tail" classes. Furthermore, CXR interpretation is inherently multi-label, requiring models to learn complex co-occurrence patterns while managing severe imbalance across all 30 disease labels simultaneously.
Gap: Prior research often evaluates loss functions, architectures, and post-training strategies in isolation. There is a lack of systematic analysis on how these components interact specifically for long-tailed multi-label medical imaging.

2. Methodology

The study was conducted on the CXR-LT 2026 benchmark, derived from the PadChest dataset, containing approximately 143,000 images with 30 disease labels.

A. Loss Functions

The authors systematically compared three loss strategies:

Binary Cross-Entropy (BCE): The standard baseline.
Asymmetric Loss (ASL): Designed for multi-label tasks to down-weight easy negatives.
LDAM with Deferred Re-weighting (LDAM-DRW):
- LDAM: Enforces larger decision margins for minority classes to improve separation.
- DRW: A two-stage training scheme where the model first learns general representations with uniform weights, then switches to class-balanced weights (based on inverse frequencies) to focus on tail classes.

B. Architecture Selection

The authors evaluated a diverse range of CNN backbones with varying capacities and design philosophies:

Foundational: ResNet-50, ResNet-101, DenseNet-121, DenseNet-169.
Efficient: EfficientFormerV2-S.
Modern: ConvNeXt-Base and ConvNeXt-Large (incorporating Transformer-inspired designs like large kernels and layer normalization).
All models were initialized with ImageNet weights and equipped with a classification head for 30 logits.

C. Post-Training Strategies

Classifier Re-training (cRT): A two-stage approach where the backbone is frozen, and only the classifier head is re-initialized and trained with class-balanced sampling to decouple representation learning from classifier optimization.
Test-Time Augmentation (TTA): Averaging predictions over horizontally flipped and rotated images.
Ensembling: Weighted averaging of predictions from multiple models.
Calibration: Techniques like temperature scaling to improve probability estimates.

3. Key Contributions

Systematic Loss Comparison: Demonstrated that LDAM-DRW consistently outperforms BCE and Asymmetric Loss for rare class recognition across all architectures. Notably, Asymmetric Loss failed in this specific setting due to extreme imbalance suppressing gradients for sparse positive labels.
Architecture Analysis: Provided evidence that modern architectures, specifically ConvNeXt-Large, significantly outperform conventional backbones (ResNet/DenseNet) in long-tailed settings, achieving the best single-model performance.
Strategy Evaluation: Analyzed the trade-offs of post-training strategies, finding that while cRT and TTA improve ranking metrics (AUC/mAP), they do not always translate to better instance-level accuracy (F1) and can degrade probability calibration.
Real-World Benchmarking: A candid analysis of the 5th-place submission to the CXR-LT 2026 challenge, highlighting the significant performance gap between development and test sets.

4. Results

Best Performance: The ConvNeXt-Large model trained with LDAM-DRW achieved the highest single-model performance on the development set:
- mAP: 0.5220
- F1 Score: 0.3765
Loss Function Impact: Switching from BCE to LDAM-DRW on ResNet-50 improved mAP by over 30% (0.3248 $\to$ 0.4241). Asymmetric Loss performed poorly (0.0667 mAP).
Post-Training Effects:
- cRT improved AUC significantly (better ranking) but did not consistently improve F1 and increased calibration error.
- TTA provided modest gains in AP/AUC but worsened calibration.
- Ensembling improved AUC but failed to surpass the best single model in mAP or F1.
Challenge Leaderboard: The team's submission ranked 5th out of 68 teams with an mAP of 0.3950.
- Observation: There was a significant drop from the development mAP (0.52) to the test mAP (0.395).
- Reasoning: The authors attribute this to overfitting on the internal validation split, sub-optimal probability calibration, and a focus on ranking metrics rather than instance-level detection thresholds. The test F1 score was notably low (0.0945).

5. Significance and Future Directions

Clinical Relevance: The findings suggest that for automated clinical screening, LDAM-DRW combined with modern CNNs (ConvNeXt) should be the default baseline. Reliable detection of rare pathologies is as critical as detecting common ones.
Metric Trade-offs: The study highlights a crucial disconnect between ranking metrics (mAP/AUC) and instance-level accuracy (F1). High mAP does not guarantee high F1 if probability calibration and decision thresholds are not optimized.
Future Work: The authors propose that future improvements should focus on:
- Per-class threshold tuning to convert good ranking performance into accurate instance detection.
- Advanced calibration techniques (e.g., temperature scaling, isotonic regression).
- Robustness methods like Sharpness Aware Minimisation (SAM) and Weight Balancing to handle distribution shifts.
- Graph-based methods to explicitly model label co-occurrence patterns.

In conclusion, this paper provides a comprehensive empirical guide for handling long-tailed multi-label medical imaging, establishing that while modern architectures and specific loss functions (LDAM-DRW) provide a strong foundation, addressing the calibration and thresholding gap is essential for clinical deployment.