Routing without Forgetting

Imagine you are a master chef running a busy kitchen. In the world of Artificial Intelligence (AI), specifically Vision Transformers (the "chefs" that recognize images), there's a big challenge called Continual Learning.

Here is the problem: Usually, a chef learns to cook Italian food. Then, they are suddenly asked to learn Thai food. If they try to learn Thai, they might accidentally forget how to make pasta. This is called "catastrophic forgetting."

Most current AI chefs try to solve this by hiring specialized sous-chefs (called prompts or adapters).

The Old Way: When you need Italian, you call the "Italian Sous-Chef." When you need Thai, you call the "Thai Sous-Chef."
The Problem: In the real world (called Online Learning), orders come in a fast, chaotic stream. You might get an Italian dish, then a Thai dish, then a French dish, all in one second. You only get to see each order once. There isn't enough time to train a new sous-chef for every single order, and you can't remember every single recipe perfectly. The "sous-chefs" get confused, and the kitchen slows down.

Enter: "Routing Without Forgetting" (RwF)

The authors of this paper propose a brilliant new way to run the kitchen. Instead of hiring new sous-chefs, they give the Head Chef (the main AI model) a magical, instant mental map.

The Magic Metaphor: The "Smart Librarian"

Imagine the Head Chef has a library of all the ingredients they've ever seen.

Old AI: When a new order comes in, the chef has to flip through a physical index card (a "prompt") to find the right recipe. If the card is wrong, they have to rewrite it slowly over time.
RwF (The New Way): The chef doesn't look at cards. Instead, they have a super-fast, magical librarian living inside their brain.
1. The chef looks at the new order (the image).
2. The librarian instantly scans the current ingredients on the counter and says, "Hey! This order looks a lot like the 'Spicy Curry' we made 5 minutes ago, but with a hint of 'Pasta'."
3. The chef instantly mixes the right mental state to handle this specific mix of ingredients.

This "librarian" is based on something called Modern Hopfield Networks. In simple terms, it's a mathematical way of saying: "Look at what you have right now, find the closest match in your memory, and blend them together instantly."

Why is this a game-changer?

1. No More "Training" for Every New Order
In the old method, the AI had to study the new Thai dish for a while to "specialize" its Thai-sous-chef. In RwF, the chef doesn't need to study. They just route their attention.

Analogy: It's like a GPS. You don't need to rebuild the road every time you drive to a new place; the GPS just instantly calculates the best route based on where you are right now.

2. It Works in the "One-Shot" Chaos
The paper tests this in a "strict online" setting. Imagine a conveyor belt of images moving so fast you only see each one for a split second.

Old AI: Gets overwhelmed. It tries to learn slowly, but the belt moves too fast. It forgets the first item by the time it learns the second.
RwF: Because the "routing" happens instantly in a single step (like a reflex), the chef adapts immediately. Even if the stream of orders changes from Italian to Thai to Sushi in a blink, the chef's internal focus shifts smoothly without panic.

3. No Memory Bloat
Old methods often need to save a "replay buffer" (a list of past orders to review later) or keep thousands of tiny specialized modules.

RwF: It keeps the kitchen small. It doesn't store extra recipes. It just changes how it uses the existing tools based on the current situation. It adds very little extra weight to the chef's brain (only about 2% more parameters).

The Results: The "Super Chef"

The researchers tested this on huge datasets (like ImageNet, which is like a massive encyclopedia of millions of photos).

The Score: RwF beat almost every other method, especially when the tasks were hard and changed frequently.
The Few-Shot Test: Even when the chef was given very few examples of a new dish (like only 20% of the usual ingredients), RwF kept performing well, while other chefs started to fail.

Summary

Routing Without Forgetting is like giving an AI a superpower of instant adaptability. Instead of trying to memorize every new task by building new rooms in its house, it learns to instantly rearrange its furniture to fit the new situation. It's a smoother, faster, and more efficient way for AI to learn continuously without forgetting what it already knows.

In a nutshell:

Old Way: "I need a new tool for this job. Let me build it slowly." (Too slow for real-time).
RwF Way: "I have all the tools. I'll just instantly grab the right combination for this specific moment." (Fast, smooth, and forgets nothing).

Here is a detailed technical summary of the paper "Routing without Forgetting (RwF)".

1. Problem Statement

The paper addresses the challenges of Online Continual Learning (OCL) within Vision Transformers (ViTs).

The Setting: In OCL, data arrives as a non-stationary stream where each sample is observed only once (single-pass). There are no explicit task identifiers at inference time, and the model must discriminate among all classes seen so far (Class-Incremental Learning).
The Limitation of Current Methods: Existing parameter-efficient adaptation methods (e.g., Prompt Tuning, Adapters, LoRA) rely on iterative gradient-based specialization. They learn task-specific parameters (prompts or low-rank matrices) over multiple epochs or repeated exposures.
The Core Issue: In strict OCL, data is rarely revisited. Consequently, gradient-driven specialization is too slow to adapt to distribution shifts before the next task arrives. This leads to catastrophic forgetting because the model cannot reconfigure its internal representations fast enough without repeated optimization steps.

2. Methodology: Routing without Forgetting (RwF)

The authors propose reframing continual learning not as a parameter storage problem, but as a dynamic routing problem. Instead of storing task-specific parameters, the model dynamically selects the appropriate representational subspace for each input within a single forward pass.

Core Architecture

RwF augments the standard Transformer backbone with Energy-Based Associative Retrieval Layers inspired by Modern Hopfield Networks.

Hopfield Pooling: Before the self-attention block in selected transformer layers, a HopfieldPooling module performs a "many-to-few" mapping. It compresses the current sequence of $L$ token embeddings into a small set of $m$ input-conditioned routing prompts ( $P_\ell$ ).
Associative Retrieval: The routing prompts are generated via a closed-form minimization of a strictly convex free-energy functional.
- Mathematically, the routing matrix $A_\ell$ is computed as a softmax over the similarity between learnable query vectors and the input token keys.
- The retrieved prompts $P_\ell$ are convex combinations of the input features, weighted by their compatibility with the current input geometry.
Forward Pass Flow:
1. Input tokens $Z_\ell$ enter the layer.
2. HopfieldPooling generates prompts $P_\ell$ based on $Z_\ell$ .
3. $P_\ell$ and $Z_\ell$ are concatenated and processed by the standard Multi-Head Self-Attention (MHSA).
4. Only the updated backbone tokens $\tilde{Z}_\ell$ are passed to the next layer; the updated prompts $\tilde{P}_\ell$ are discarded.
Key Design Choices:
- No Task Identifiers: Routing is purely input-conditioned.
- No Replay Buffer: The method is buffer-free.
- Fixed Projections: The projection matrices ( $W_K, W_V$ ) used for retrieval are kept frozen (untrained). This ensures the similarity space remains stationary, preventing the routing mechanism from drifting as the backbone updates, thereby stabilizing the routing decisions.
- Analytical Adaptation: Routing decisions are computed analytically in a single step, decoupling representation selection from the slower timescale of gradient descent.

Theoretical Foundation

The method leverages the variational interpretation of Modern Hopfield Networks. The retrieval process minimizes a free-energy functional $F(p) = -\sum p_i \langle \tilde{q}, k_i \rangle + \beta^{-1}H(p)$ .

Plasticity: The alignment term encourages concentrating on tokens compatible with the current input.
Stability: The entropy term prevents degenerate one-hot assignments, ensuring smooth transitions.
Smoothness: Because the routing operator is continuous and Lipschitz with respect to input features, small distribution shifts result in proportional, smooth changes in routing weights, mitigating abrupt representational jumps that cause forgetting.

3. Key Contributions

Paradigm Shift: Recasts OCL in transformers as an associative routing problem rather than a parameter specialization problem.
Architectural Innovation: Introduces HopfieldPooling layers directly into the transformer backbone to enable closed-form, input-conditioned routing without external memory buffers or task IDs.
Parameter Efficiency: Achieves state-of-the-art performance with only ~2.13% additional trainable parameters (comparable to LoRA/Adapters), avoiding the need for dual backbones or large prompt pools.
Theoretical Guarantee: Demonstrates that routing decisions correspond to the equilibrium of a strictly convex energy functional, ensuring unique, stable, and smooth adaptation in a single forward pass.

4. Experimental Results

The authors evaluated RwF on three Class-IL benchmarks: Split-CIFAR-100, Split-ImageNet-R, and Split-ImageNet-S, under strict single-pass protocols.

Overall Performance:
- Split-ImageNet-R: RwF achieved 74.09% Final Average Accuracy (AFinal), significantly outperforming strong baselines like DualPrompt (60.88%), CODA-Prompt (66.16%), and Online-LoRA (48.18%).
- Split-ImageNet-S: RwF achieved 61.37% AFinal, surpassing all prompt-based and adapter-based methods (e.g., EASE at 55.89%).
- Split-CIFAR-100: RwF achieved 82.48%, competitive with top methods (EASE: 84.81%, DualPrompt: 83.58%). The authors note that on lower-resolution datasets, the advantage of dynamic feature reallocation is slightly reduced due to less rich feature geometry.
Few-Shot Robustness:
- In regimes with reduced training data (down to 20% of samples), RwF maintained superior performance (62.29% at 20% data on ImageNet-R) compared to prompt-based methods which suffered sharp declines. This confirms that input-conditioned routing does not rely on repeated gradient updates to stabilize.
Scalability:
- As the number of sequential tasks increased (from 5 to 40), RwF maintained a consistent margin over baselines, demonstrating better scalability under frequent distribution shifts.
Ablation Studies:
- Routing Depth: Inserting HopfieldPooling layers in the early blocks (First-k) yielded the best trade-off. The default configuration ( $k=3$ ) provided strong performance with minimal parameter overhead.
- Frozen Projections: Keeping the retrieval projection matrices fixed was crucial for stability; training them caused drift and degraded performance.

5. Significance and Conclusion

Routing without Forgetting provides a principled solution to the "plasticity-stability" dilemma in Online Continual Learning.

Decoupling Speed: By computing routing analytically via energy minimization, the model adapts its internal representation flow immediately to new data, independent of the slow convergence of gradient-based parameter updates.
Structural Stability: The continuous, smooth nature of the associative operator prevents the catastrophic forgetting often caused by abrupt shifts in attention patterns.
Implication: The paper suggests that stability in continual learning can emerge from architectural mechanisms that smooth representation flow, rather than relying solely on gradient constraints, replay buffers, or explicit task partitioning. This offers a new direction for designing efficient, robust transformers for streaming data environments.

Routing without Forgetting

Enter: "Routing Without Forgetting" (RwF)

The Magic Metaphor: The "Smart Librarian"

Why is this a game-changer?

The Results: The "Super Chef"

Summary

1. Problem Statement

2. Methodology: Routing without Forgetting (RwF)

Core Architecture

Theoretical Foundation

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning