Unlocking [CLS] Features for Continual Post-Training

Imagine you have a brilliant, world-class chef (the Foundation Model) who has spent years learning to cook thousands of different dishes. This chef knows the basics of chopping, sautéing, and seasoning perfectly.

Now, imagine you want to teach this chef a few new, very specific recipes (like "Vegan Sushi" or "Spicy Tacos") without making them forget how to cook their original thousands of dishes.

This is the problem of Continual Learning. If you just tell the chef to "learn these new recipes," they might get confused and start putting soy sauce in their pasta (this is called Catastrophic Forgetting). If you try to be too careful and not let them change anything, they won't be able to learn the new recipes at all (this is the Stability-Plasticity Dilemma).

Most current methods try to solve this by either:

Giving the chef a new apron for every recipe (Prompts): This is safe, but the chef might get confused about which apron to wear for which dish.
Adding a whole new kitchen station for every recipe (Adapters): This works well, but it takes up a massive amount of space in the kitchen and is very expensive to build.

The Solution: TOSCA (The "Smart Tasting Spoon")

The authors of this paper propose a new, much simpler way called TOSCA.

Here is how it works, using a simple analogy:

1. The "Ventral Stream" vs. The "Prefrontal Cortex"

The paper draws inspiration from the human brain.

The Ventral Stream (The Chef's Muscle Memory): This is the part of the brain that handles stable, unchanging facts (like "how to hold a knife"). The paper says we should leave the Foundation Model's core layers alone, just like we don't re-teach a chef how to hold a knife every time they learn a new dish.
The Prefrontal Cortex (The Decision Maker): This is the part of the brain that makes the final choice based on the current situation.

2. The "LuCA" Module (Learn and Calibrate)

Instead of building a whole new kitchen, TOSCA installs a tiny, smart device right at the very end of the cooking line, just before the food is served. This device is called LuCA. It has two parts:

The Adapter (The Adjuster): This is like a small spoon that adds a tiny bit of extra spice or sauce specifically for the new dish. It makes small changes to the food.
The Calibrator (The Taster): This is a smart taster who checks the food. If the "Adjuster" added too much spice, the "Taster" says, "Whoa, dial it back." If the food is too bland, the "Taster" says, "Add a pinch more." It ensures the final flavor is perfect for this specific new recipe.

3. The "Token-Level" Trick

Here is the genius part: They only put this device on the very last plate.
In computer terms, the model processes an image through many layers. Most methods try to add these "Adjuster/Taster" devices to every single layer of the model. That's like putting a taster in the pantry, the stove, the fridge, and the oven. It's wasteful and messy.

TOSCA says: "Let's just put the taster on the final plate (the [CLS] token) right before it goes to the customer."

Why? Because by the time the food reaches the final plate, the chef has already done all the hard work. The taster just needs to make a tiny tweak to ensure it's perfect for the new order.
The Result: You get a perfect new dish without messing up the chef's muscle memory, and you don't need to build a new kitchen.

Why is this a Big Deal?

It's Tiny: TOSCA uses about 8 times fewer parameters (memory space) than other methods. It's like adding a single spice jar instead of a whole new pantry.
It's Fast: Because it's so small, it trains and runs incredibly fast.
It Doesn't Forget: By only tweaking the very end of the process, the model remembers all its old skills perfectly while learning new ones.
No "Cheat Sheets": When the model is tested, it doesn't need to be told "This is a sushi order." It just looks at the food, tries the different "tasters" it has learned, and picks the one that makes the food taste the most confident (lowest uncertainty).

Summary

Think of TOSCA as a smart, tiny filter placed at the very exit of a factory. The factory (the AI model) keeps running exactly the same way it always has, producing high-quality goods. When a new product comes down the line, the filter makes a tiny, precise adjustment to ensure it meets the new specs, without ever needing to stop the factory or rebuild the machines.

It solves the problem of "learning new things without forgetting old things" by being incredibly efficient, biologically inspired, and surprisingly simple.

1. Problem Statement

The paper addresses the Stability-Plasticity Dilemma in Class-Incremental Learning (CIL) using Foundation Models (FMs), specifically Vision Transformers (ViTs).

The Challenge: Continual learning requires a model to learn new classes over time without forgetting previously learned ones (Catastrophic Forgetting).
Current Limitations:
- Full Fine-tuning: Leads to severe forgetting as pre-trained representations are overwritten.
- Prompt-based Methods: Offer high stability by keeping the backbone frozen but often lack the plasticity to adapt effectively to complex, new tasks.
- Adapter-based Methods: Insert trainable modules into every layer of the network. While they offer plasticity, they suffer from quadratic parameter growth relative to model depth, high computational costs, and cumulative feature drift across layers.
Goal: Develop a post-training strategy that introduces minimal modifications to achieve a harmonious balance between stability (preserving general knowledge) and plasticity (adapting to new tasks) without relying on data replay.

2. Methodology

The proposed approach is inspired by neuroscience, specifically the interaction between the ventral visual stream (stable feature extraction) and the prefrontal cortex (flexible, task-specific modulation).

A. Core Module: LuCA (Learn and Calibrate)

The authors introduce a new Parameter-Efficient Fine-Tuning (PEFT) module called LuCA, consisting of two sequential components:

Residual Adapter: Applies task-specific feature transformations using a bottleneck structure ( $z + \psi(zW_{down})W_{up}$ ). This preserves original semantics via skip connections while learning offsets.
Calibrator: A gating mechanism that reweights the adapter's output features using an attention-like structure ( $z \odot \sigma(zV_{down})V_{up}$ ). It acts as a soft importance mask, amplifying discriminative features and suppressing noisy or over-activated channels.

B. The TOSCA Framework

Token-level Only Sparse Calibration and Adaptation (TOSCA) is the specific instantiation of LuCA for CIL.

Strategic Placement: Unlike existing methods that insert modules into every transformer layer, TOSCA deploys a single sparse LuCA module exclusively on the final [CLS] token just before the classifier.
Rationale:
- The [CLS] token aggregates all semantic information from the frozen backbone.
- Modifying only this token preserves the low/mid-level feature hierarchy (Stability) while allowing high-level abstraction to adapt (Plasticity).
- It ensures the parameter count is architecture-agnostic (fixed at $4 \times d \times r$ ) and does not scale with model depth.
Training Protocol:
- The pre-trained backbone is frozen.
- For each new task $t$ , a new sparse LuCA module ( $\Theta_t$ ) is trained.
- $\ell_1$ -Regularization: A sparsity constraint is applied to $\Theta_t$ to induce orthogonality between task modules. This ensures each module specializes in distinct feature dimensions, preventing interference.
Inference Protocol:
- The frozen backbone extracts the [CLS] feature once.
- All task-specific modules process this feature in parallel.
- Entropy Minimization: The system selects the module producing the lowest output entropy (highest confidence) to make the final prediction. This eliminates the need for task identifiers or exemplar replay.

3. Key Contributions

LuCA Module: A novel PEFT design combining an adapter for transformation and a calibrator for feature refinement, offering better control over feature representations than standard adapters.
TOSCA Framework: A neuro-inspired, theoretically grounded approach that localizes adaptation to the final [CLS] token. It achieves a unique balance of stability and plasticity with a fixed, minimal parameter footprint, unlike layer-wise adapters that scale linearly with depth.
State-of-the-Art Performance: Demonstrated superior results across six benchmarks with significantly fewer parameters (~~8 $\times$ fewer than layer-wise adapters) and faster runtime (~~2.5 $\times$ faster).

4. Experimental Results

The authors evaluated TOSCA on six benchmarks (CIFAR-100, CUB-200, ImageNet-R, ImageNet-A, OmniBenchmark, VTAB) and an out-of-distribution (OOD) dataset (EuroSAT).

Accuracy: TOSCA achieved the highest average accuracy ( $\bar{A}$ $\overset{ˉ}{A}$ ) and last-stage accuracy ( $A_T$ $A_{T}$ ) across all datasets.
- Outperformed prompt-based methods by 7–21% on OOD datasets.
- Outperformed adapter-based methods by 4–12%.
- On the CUB-200 fine-grained dataset, TOSCA achieved 97.6% average task-wise accuracy, surpassing methods that adapt multiple layers.
Efficiency:
- Parameters: Introduced ~8 $\times$ fewer parameters compared to layer-wise adapter methods (e.g., EASE, MOS).
- Runtime: Achieved ~2.5 $\times$ faster overall runtime due to single-token processing and lack of complex retrieval mechanisms.
Robustness: On the EuroSAT dataset (severe domain shift), TOSCA maintained 99.3% average accuracy, significantly outperforming competitors that suffered from catastrophic forgetting under distribution shifts.
Ablation Studies:
- Confirmed that the Adapter $\to$ Calibrator order is superior to the reverse.
- Showed that $\ell_1$ regularization ( $\lambda = 5e^{-4}$ ) effectively induces orthogonality between modules, reducing cosine similarity between task representations and improving selection accuracy.
- t-SNE visualizations confirmed that TOSCA creates compact, well-separated class clusters compared to vanilla ViT or standard adapters.

5. Significance

This paper presents a paradigm shift in continual learning for Foundation Models:

Efficiency: It proves that high-level semantic adaptation at the decision boundary is sufficient for continual learning, removing the need for expensive, layer-wise modifications.
Scalability: The fixed parameter cost makes it highly scalable for very large models where layer-wise adaptation is computationally prohibitive.
Practicality: By eliminating the need for data replay and task identifiers, TOSCA offers a privacy-preserving and memory-efficient solution suitable for real-world deployment in resource-constrained or privacy-sensitive environments (e.g., healthcare).
Biological Plausibility: The design mirrors biological learning principles (stable sensory processing + flexible cortical modulation), providing a theoretically grounded approach to solving the stability-plasticity dilemma.

Unlocking [CLS] Features for Continual Post-Training

The Solution: TOSCA (The "Smart Tasting Spoon")

1. The "Ventral Stream" vs. The "Prefrontal Cortex"

2. The "LuCA" Module (Learn and Calibrate)

3. The "Token-Level" Trick

Why is this a Big Deal?

Summary

1. Problem Statement

2. Methodology

A. Core Module: LuCA (Learn and Calibrate)

B. The TOSCA Framework

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank