Few-Shot Continual Learning for 3D Brain MRI with Frozen Foundation Models

Imagine you have a brilliant, super-smart medical student named Dr. Foundation. Dr. Foundation has spent years studying millions of 3D brain scans from every hospital in the world. This student knows the anatomy of the brain better than anyone else.

Now, imagine a hospital wants to hire Dr. Foundation for two very different jobs, but they can only show the student a tiny handful of examples for each new job.

Job A: Find and outline tumors in brain scans (Segmentation).
Job B: Guess a patient's age based on their brain scan (Regression).

The catch? The hospital has strict privacy rules. They can't save the old brain scans to show the student later. Once the student learns Job B, they must forget the old data to make room, but they still need to be perfect at Job A.

This is the problem of Continual Learning: How do you teach a new skill without forgetting the old one?

The Three Approaches

The paper tests three different ways to train Dr. Foundation:

1. The "Rewrite the Textbook" Method (Sequential Full Fine-Tuning)

In this approach, when the student learns Job B, we force them to rewrite their entire brain (the neural network) to fit the new task.

What happens: The student becomes an expert at guessing ages (Job B), but because they rewrote their entire brain, they completely forget how to find tumors. They might look at a tumor and say, "That's just a normal wrinkle!"
Result: Great at the new job, terrible at the old one. This is called Catastrophic Forgetting.

2. The "Just the Glasses" Method (Sequential Linear Probing)

Here, we tell the student: "Don't change your brain at all. Just put on a new pair of glasses (a simple head) for Job B."

What happens: The student remembers how to find tumors perfectly because their brain never changed. However, because they didn't learn the new task deeply, they are terrible at guessing ages. They might guess everyone is 50 years old.
Result: Great at the old job, terrible at the new one.

3. The "Specialized Sticky Notes" Method (The Paper's Solution: LoRA)

This is the method the authors propose. They keep Dr. Foundation's brain completely frozen (unchanged). Instead of rewriting the brain or just adding glasses, they attach a tiny, specialized "Sticky Note" (called a LoRA Adapter) to the student's desk for each specific job.

How it works:
- For Job A (Tumors), they stick a blue note on the desk that says "Look for tumors here."
- For Job B (Age), they stick a red note that says "Look for age clues here."
- When the student needs to do Job A, they look at the blue note. When they need Job B, they look at the red note.
- The student's actual brain (the foundation model) never changes. The "Sticky Notes" are tiny, cheap, and easy to swap.

Why This is a Game-Changer

The paper found that the "Sticky Note" method (LoRA) is the only one that works for both jobs simultaneously:

No Forgetting: Because the main brain is frozen, the student never forgets how to find tumors, even after learning to guess ages. The "Backward Transfer" (forgetting) is literally zero.
Balanced Performance: The student becomes good enough at both jobs. They aren't the absolute best at finding tumors (compared to the method that forgets everything), but they are very good, and they are also the only method that doesn't fail completely at guessing ages.
Efficiency: The "Sticky Notes" are incredibly small. You only need to train about 0.1% of the total parameters. It's like adding a single sentence to a 500-page book instead of rewriting the whole thing.

The Catch (Limitations)

Even with this clever trick, there are a few hiccups:

The "Age Guess" Bias: When guessing ages, the model tends to be a bit too conservative, guessing that everyone is younger than they actually are. The authors suspect this is because the training data had some missing ages that were filled in with a default number (50), confusing the model slightly.
Tumor Boundaries: While the model finds tumors well, it sometimes misses the very fine edges of the tumor. It's like a painter who gets the color right but misses the tiny details of the outline.

The Big Picture

In the real world, hospitals often add new tasks over time (e.g., "Hey, can we also detect strokes now?"). They can't keep every single old patient's data due to privacy laws.

This paper says: Don't try to retrain the whole AI. Just freeze the smart, pre-trained brain and give it a tiny, specific "cheat sheet" (LoRA) for the new task. This way, the AI stays smart, remembers everything it learned before, and learns new things quickly without needing a massive computer or a database of old patients.

In short: It's the difference between trying to memorize a new language by erasing your native tongue (bad idea) versus learning a new language by keeping your native tongue and just learning a few new phrases (brilliant idea).

1. Problem Statement

The paper addresses the challenge of Few-Shot Continual Learning (FSCL) in medical imaging, specifically for 3D Brain MRI. The core problem involves adapting large-scale pretrained foundation models to a sequence of downstream tasks with limited labeled data (few-shot) while adhering to strict privacy and storage constraints that prevent access to previous task data (no replay).

Context: Clinical settings often require adding new analysis tasks (e.g., brain age estimation) to existing workflows (e.g., tumor segmentation) over time.
Constraints:
- Sequential Arrival: Tasks arrive one by one ( $T_1 \to T_2$ ).
- No Replay: Data from previous tasks is unavailable during new task training.
- Few-Shot: Only a small number of labeled samples ( $N_k \in \{16, 32, 64\}$ ) are available per task.
The Challenge: Standard sequential fine-tuning leads to catastrophic forgetting, where performance on prior tasks degrades sharply when the model is updated for a new task. Existing solutions like Elastic Weight Consolidation (EWC) or Learning without Forgetting (LwF) require careful hyperparameter tuning and may still overwrite critical representations.

2. Methodology

The authors propose a framework that combines a frozen foundation model backbone with task-specific Low-Rank Adaptation (LoRA) modules.

Core Architecture

Frozen Backbone ( $f_\theta$ ): A pretrained 3D UNet (based on the FOMO model) is kept entirely frozen. This preserves the rich, generalizable representations learned from large-scale pretraining.
Task-Specific Adapters: For each new task $k$ $k$ , a dedicated LoRA adapter ( $\phi_k$ $ϕ_{k}$ ) and a task-specific head ( $h_k$ $h_{k}$ ) are introduced.
- LoRA Mechanism: Trainable low-rank matrices ( $A \in \mathbb{R}^{r \times k}, B \in \mathbb{R}^{d \times r}$ ) are injected into the frozen weights ( $W$ ) such that $W' = W + BA$. Only $A$ and $B$ are trained.
- Placement:
  - For Segmentation (Tumor): LoRA is applied to both Encoder and Decoder blocks.
  - For Regression (Age): LoRA is applied only to the Encoder.
Training Protocol:
1. Freeze the backbone and all previous adapters ( $\phi_{<k}$ ).
2. Add the new adapter $\phi_k$ and head $h_k$ .
3. Train only $\phi_k$ and $h_k$ on the few-shot samples of the current task.
4. Save the adapter and proceed to the next task.
Inference: To perform task $k$ , the system loads the frozen backbone + adapter $\phi_k$ + head $h_k$ .

Theoretical Advantage

Because the backbone and previous adapters are never updated when learning a new task, Backward Transfer (BWT) is mathematically guaranteed to be zero. This eliminates catastrophic forgetting by design without needing complex regularization or replay buffers.

3. Experimental Setup

Datasets:
- Task 1 (T1): Tumor Segmentation (BraTS 2023 Glioma). Input: T2w, T2-FLAIR, T1c. Output: Binary mask.
- Task 2 (T2): Brain Age Estimation (IXI dataset). Input: T1, T2. Output: Age regression (years).
Baselines:
- Sequential Full Fine-Tuning (FT).
- Sequential Linear Probing (Frozen backbone, train heads only).
- EWC (Elastic Weight Consolidation).
- LwF (Learning without Forgetting).
- Replay (Experience Replay).
Metrics:
- T1: Dice Similarity Coefficient (Dice).
- T2: Mean Absolute Error (MAE).
- BWT: Measure of forgetting (Performance on $T_1$ after training $T_2$ minus Performance on $T_1$ immediately after $T_1$ ).

4. Key Results

The proposed LoRA approach achieved the best balanced performance across both tasks, the only method to maintain reasonable performance on both without forgetting.

Method	T1 Dice (Seg)	T2 MAE (Age)	BWT (Forgetting)	Notes
Sequential FT	0.80 $\to$ 0.16	0.005*	-0.65	Severe forgetting; T1 collapses.
Sequential Linear	0.79	1.45	-0.01	Strong T1, but fails completely on T2.
EWC	0.79 $\to$ 0.15	0.001*	-0.65	Severe forgetting; T2 MAE likely overfit.
LwF	0.80 $\to$ 0.25	0.020	-0.56	Moderate to severe forgetting.
Replay	0.79 $\to$ 0.01	0.021	-0.78	Worst forgetting.
Proposed (LoRA)	0.60	0.012	0.00	Best balance. Zero forgetting.

Performance Trade-off: While Sequential FT and Linear Probing achieved higher Dice scores on T1 (0.80/0.79) compared to LoRA (0.60), they failed to generalize to T2 or suffered catastrophic forgetting. LoRA provided a competitive T2 MAE (0.012) while maintaining a usable T1 Dice (0.60).
Parameter Efficiency: LoRA requires <0.1% trainable parameters per task (approx. 46k params for segmentation, 28k for regression) relative to the ~50M parameter backbone.
Ablation Studies:
- Placement: Using LoRA on the Encoder+Decoder was crucial for segmentation (Dice 0.50 vs 0.19 for Encoder-only).
- Shot Count: Performance improved with more shots (16 $\to$ 32 $\to$ 64), but LoRA remained robust even at low shot counts.
- Task Order: In a reverse order experiment (Regression $\to$ Segmentation), Sequential FT showed severe forgetting of the regression task (BWT $\approx$ 7.16), whereas LoRA maintained BWT=0.

5. Limitations & Observations

Systematic Bias: The model exhibited a systematic underestimation of brain age (Wilcoxon $p < 0.001$ ). This is partly attributed to the imputation of missing age labels in the IXI dataset to 50.0 years, which concentrated the ground truth for the "best" performing samples.
Segmentation Performance Gap: LoRA's Dice score (0.60) was lower than the peak performance of full fine-tuning (0.80). The authors suggest this may be due to the limited capacity of the LoRA adapters or the difficulty of adapting the decoder with few shots.
Overfitting in Baselines: Baselines like Sequential FT and EWC showed implausibly low MAE (0.001–0.005) on T2, which the authors attribute to overfitting on the small few-shot validation set, as confirmed by per-task linear probing on full evaluation (MAE 0.063).

6. Significance and Contributions

Practical Clinical Solution: The framework offers a viable path for hospitals to incrementally add AI capabilities (e.g., adding age estimation to a tumor workflow) without retraining from scratch, storing historical data, or risking the loss of existing diagnostic accuracy.
Zero-Forgetting Guarantee: By isolating task adapters and freezing the backbone, the method guarantees $BWT=0$, solving the catastrophic forgetting problem theoretically rather than heuristically.
Efficiency: The approach is highly parameter-efficient (<0.1% trainable params), enabling fast adaptation and low storage costs, which is critical for edge deployment or resource-constrained environments.
Heterogeneous Task Handling: Successfully demonstrates continual learning across heterogeneous output types (segmentation masks vs. regression values) using a single frozen foundation model.

In conclusion, the paper establishes that Frozen Foundation Models + Task-Specific LoRA is the superior strategy for few-shot continual learning in 3D medical imaging, offering a robust trade-off between performance retention and adaptability where traditional methods fail.