Task-Agnostic Continual Learning for Chest Radiograph Classification

Imagine a hospital's AI system as a brilliant, overworked radiologist named "Dr. X-Ray."

The Problem: The "All-or-Nothing" Dilemma

In the past, if Dr. X-Ray needed to learn a new way of reading X-rays (say, from a different hospital with slightly different cameras or labeling habits), the old way of doing things was to wipe his memory clean and start over.

The Old Way: You'd show him 10,000 new X-rays, and he'd relearn everything from scratch. He might get really good at the new style, but he'd forget how to read the old ones perfectly. Or, you'd have to show him every single X-ray from the last 10 years every time he learned something new, which is impossible because of privacy laws and storage limits.

The Solution: The "Specialized Interns" System (CARL-XRay)

The authors of this paper propose a smarter way called CARL-XRay. Instead of retraining the whole doctor, they keep the main doctor's brain (the "Backbone") frozen and unchanged. This brain is already a master at seeing bones, lungs, and shadows.

When a new type of X-ray dataset arrives, they don't retrain the brain. Instead, they hire a tiny, specialized intern (called an "Adapter") just for that specific job.

The Backbone: The senior doctor who never forgets the basics.
The Adapters: A team of specialized interns. Intern A knows how to read "Hospital A's" X-rays. Intern B knows "Hospital B's." They are small, cheap to train, and don't mess with the senior doctor's brain.

The Big Challenge: "Who Am I Talking To?"

Here is the tricky part: In a real hospital, when a new X-ray comes in, the computer doesn't know which hospital it came from. It's like a patient walking in without a name tag.

If the computer guesses wrong and sends the X-ray to "Intern A" (who only knows Hospital A), but the X-ray is actually from Hospital B, the diagnosis will be wrong.
The system needs a Traffic Cop (called a "Latent Task Selector") to look at the X-ray and say, "Ah, this looks like it belongs to Intern B's group. Send it there!"

How They Keep the Traffic Cop Honest

The biggest fear in AI is Catastrophic Forgetting. As the Traffic Cop learns to recognize Hospital B, it might start forgetting what Hospital A looks like.

To fix this, the researchers use a clever trick called Feature-Level Experience Replay:

Instead of storing thousands of actual X-ray images (which is illegal or too expensive), the system saves tiny, compressed "snapshots" of what the interns saw when they learned their jobs.
Every time the Traffic Cop learns a new intern, it reviews these old snapshots to make sure it hasn't forgotten the old interns. It's like the Traffic Cop keeping a small, private diary of "what the interns looked like" rather than a photo album of every patient.

The Results: Why This Matters

The team tested this on two massive real-world datasets (MIMIC-CXR and CheXpert). Here is what they found:

It Doesn't Forget: When they taught the system a second task, it didn't forget the first one. The "forgetting" was almost zero.
It's Smarter at Guessing: When they didn't tell the system which hospital the X-ray was from (the "Task Unknown" scenario), their system guessed the right intern 75% of the time.
- Comparison: A standard method that tries to learn everything at once (Joint Training) only guessed right 62.5% of the time.
It's Efficient: They only had to train a tiny fraction of the parameters (0.08%). It's like upgrading a car's navigation system by just changing the map app, rather than rebuilding the whole engine.

The Takeaway

This paper introduces a way for medical AI to grow up naturally. Instead of being a rigid system that needs a total overhaul every time new data arrives, CARL-XRay is like a flexible organization:

It keeps its core knowledge safe.
It hires small, specialized teams for new jobs.
It uses a smart traffic cop to route patients correctly, even without ID tags.
It uses a "memory diary" to ensure no one gets forgotten.

This makes it a realistic, practical solution for hospitals that need to update their AI tools over years without breaking their current systems or violating patient privacy.

1. Problem Statement

The paper addresses the critical challenge of deploying chest radiograph classifiers in real-world clinical settings. Current deep learning models typically require retraining on all historical data when new datasets become available, which is computationally expensive and often violates data privacy/governance constraints (preventing access to raw historical images).

The authors define a specific Task-Incremental Continual Learning setting with the following constraints:

Sequential Ingestion: Heterogeneous chest X-ray datasets arrive one after another.
Task-Agnostic Inference: At the time of inference, the system does not know which dataset (task) a new image belongs to (no task identifiers).
No Raw Image Replay: The system cannot store or access raw images from previous tasks due to privacy and storage limitations.
Goal: Maintain high diagnostic performance on all previously learned tasks while adapting to new ones, without catastrophic forgetting, and with minimal computational overhead.

2. Methodology: CARL-XRay

The authors propose CARL-XRay (Continual Adapter-based Routing Learning for Chest X-rays), a framework designed to handle sequential updates while preserving task identity.

A. Model Architecture

Frozen Backbone: A high-capacity Swin Transformer encoder is used as a shared backbone. Its parameters ( $\theta_\Phi$ ) remain frozen throughout the entire training process to ensure representational stability and prevent interference between tasks.
Task-Specific Modules: For each new task $k$ $k$ , the model allocates:
- A lightweight Adapter ( $A_k$ ): Transforms shared features into task-adapted features.
- A Classification Head ( $H_k$ ): Produces logits for the specific label set of that task.
- Strategy: Only the parameters for the new adapter and head are trained; previous modules are frozen. This isolates task representations.

B. Latent Task Selector

Since task identifiers are unavailable at inference, a Latent Task Selector ( $S$ ) is trained to infer the correct task context from the adapted features ( $\tilde{z}$ ).

Mechanism: The selector is a shared Multi-Layer Perceptron (MLP) that outputs a probability distribution over tasks.
Prototype Memory: A learnable memory matrix stores compact prototype embeddings ( $M_k$ ) for each task to guide the selector.
Feature-Level Experience Replay: To prevent the selector from forgetting previous tasks (catastrophic forgetting), a replay buffer stores a bounded set of adapted feature vectors (not raw images) from previous tasks. During training on a new task, the selector is optimized on a mixed batch of current and replayed features.

C. Training Objectives

Classification Loss: Masked Multi-label Binary Cross-Entropy (BCE) to handle missing labels and uncertain annotations (using soft targets for $y=-1$ ).
Orthogonality Regularizer: Encourages task-adapted features to be distinct, reducing redundancy.
Selector Loss: Cross-entropy for task prediction and a prototype consistency loss to align features with task prototypes.

D. Inference (Task-Agnostic)

When a new image arrives without a task ID:

The image is passed through the frozen backbone.
The backbone features are processed by all available task adapters to generate task-specific feature vectors.
The Latent Task Selector evaluates these vectors to determine the most likely task.
The corresponding classification head for the predicted task generates the final diagnosis.

3. Key Contributions

First Task-Incremental Formulation: Introduces the first standardized evaluation protocol for task-incremental continual learning in chest radiograph classification, reflecting realistic clinical constraints (no task IDs, no raw data access).
CARL-XRay Framework: Proposes a novel architecture combining frozen backbones, isolated adapters, and a latent task selector stabilized by feature-level replay.
Efficiency: Demonstrates that the method achieves competitive performance with significantly fewer trainable parameters (only ~0.08% of the backbone) compared to full fine-tuning or joint training.
Comprehensive Evaluation: Provides a large-scale evaluation on MIMIC-CXR and CheXpert, analyzing routing accuracy, catastrophic forgetting, and the impact of adapter designs.

4. Experimental Results

The model was evaluated on two major datasets: MIMIC-CXR (Task 1) and CheXpert (Task 2).

Diagnostic Performance:
- CARL-XRay achieved an AUROC of 0.740 on Task 1 and 0.748 on Task 2 after sequential training.
- Forgetting: The forgetting metric on Task 1 was minimal (0.012), indicating strong retention of prior knowledge.
- Comparison: Performance is comparable to joint training (training on both datasets simultaneously) when task identity is known.
Task-Agnostic Routing (Critical Finding):
- CARL-XRay: Achieved 75.0% routing accuracy under task-unknown inference.
- Joint Training Baseline: Only achieved 62.5% routing accuracy.
- Insight: Joint training merges task representations, making it difficult to distinguish between datasets at inference. CARL-XRay's isolated adapters preserve distinct task boundaries, enabling reliable routing.
Ablation Studies:
- Experience Replay: Essential for success. Without feature-level replay, routing accuracy dropped to 14.3% (catastrophic forgetting of Task 1). With replay, it jumped to 75.0%.
- Adapter Design: The Continuum adapter (multiple residual branches) outperformed Simple and Hope adapters, balancing routing accuracy (0.710) and memory usage.
- Routing Strategy: The learned selector outperformed memory-based (cosine similarity) and entropy-based routing strategies.

5. Significance

This work provides a practical solution for the clinical deployment of AI. It solves the "data silo" problem where hospitals cannot share raw patient data for retraining. By enabling models to learn sequentially from new institutions without forgetting old knowledge or requiring raw image storage, CARL-XRay offers a scalable, privacy-compliant, and computationally efficient pathway for maintaining up-to-date diagnostic AI systems in dynamic healthcare environments. The framework proves that task-aware routing is feasible and superior to joint training in scenarios where task identity is unknown at inference.