Grow, Don't Overwrite: Fine-tuning Without Forgetting

Imagine you have a brilliant, well-read librarian named Gemma. She has spent years reading millions of books, learning everything from how to bake a cake to how to solve complex physics problems. She is a master of general knowledge.

Now, you want to hire her for a very specific job: translating ancient French poetry.

The Problem: The "Overwrite" Trap

In the world of AI, when you try to teach a smart model like Gemma a new skill, a phenomenon called Catastrophic Forgetting often happens.

Think of it like this: To learn French poetry, the old librarian tries to cram new information into her brain. But her brain is full. So, to make room for the new French words, she accidentally throws out her old knowledge.

She learns to translate French perfectly.
But suddenly, she forgets how to bake a cake.
She forgets how to do basic math.
She forgets how to tell a joke.

This is the "Catastrophic Forgetting" problem. The more she learns the new job, the more she loses her original identity.

The Old Solutions (The Flawed Fixes)

Scientists have tried two main ways to fix this, but both have big downsides:

The "Brake" Method: They tell the librarian, "Don't change your brain too much!" (This is called regularization). But this is like trying to learn French while wearing heavy handcuffs. She can't learn the new job very well because she's too afraid to forget anything.
The "Add a New Brain" Method: They give her a brand new, empty brain just for French, while keeping her old brain frozen. But this new brain starts with zero knowledge. It's like hiring a fresh intern who knows nothing about the library. It takes forever to train them, and it's a waste of the librarian's existing wisdom.

The New Solution: "Grow, Don't Overwrite"

The authors of this paper came up with a clever trick called "Function-Preserving Expansion."

Instead of overwriting her old brain or giving her a blank one, they gently expand her brain to make room for the new skill without disturbing the old one.

Here is the analogy of how they do it:

1. The "Copy-Paste" Expansion

Imagine the librarian has a specific desk where she processes information.

Step 1: They take her existing desk setup and copy it exactly, placing a second identical desk right next to it. Now she has double the space to work.
Step 2: To make sure she doesn't get confused or change her output, they put a special filter on the second desk. This filter ensures that if she uses the new desk, the final result is mathematically identical to what she would have done with just the old desk.

The Magic: At the very moment they finish building this new setup, the librarian is exactly the same person she was before. She can still bake cakes, do math, and tell jokes perfectly. Nothing has changed yet.

2. The "Specialized Training"

Now, they start training her on French poetry.

Because she has extra space (the new desk), she can learn the new skill without throwing anything out.
They only train the new parts of her brain. The old parts (the original desk) remain frozen and untouched.
As she learns French, she uses the new space. The old space stays dedicated to her original knowledge.

The Results: The Best of Both Worlds

The paper shows that this method works incredibly well:

No Forgetting: The librarian learns French perfectly but never forgets how to bake a cake or do math.
Efficiency: You don't need to train the whole librarian. You only need to train the new "desk" you added. This saves a huge amount of computer power (about 60% less work!).
Modularity: If you only need her to learn a little bit of French, you only need to add a tiny bit of new space. If the task is super hard (like advanced math), you can add more space, and she gets better at it.

Why This Matters

This is a breakthrough because it solves the "Zero-Sum Game" of AI. Before, you had to choose between being a generalist (knowing everything) or a specialist (knowing one thing well).

This new method allows an AI to be both. It can grow into a specialist without ever losing its generalist soul. It's like giving a genius a new wing on their house to study a new subject, rather than forcing them to tear down their old library to make space.

In short: Instead of erasing the past to make room for the future, this method builds an addition to the house, so the family can grow without ever having to move out.

1. Problem Statement

The paper addresses catastrophic forgetting, a phenomenon where fine-tuning a pre-trained Large Language Model (LLM) on a new, specialized task causes the model to overwrite its foundational knowledge, leading to a severe degradation of performance on its original capabilities.

Existing solutions face a fundamental trade-off:

Regularization-based methods: Constrain parameter updates to preserve old knowledge but often limit the model's ability to learn new skills effectively (a zero-sum game between plasticity and stability).
Capacity growth methods: Add new parameters to learn new tasks while freezing the old ones. However, current approaches struggle to balance stability (ensuring the model's output doesn't change at initialization) and efficiency (reusing pre-trained knowledge rather than starting from scratch).
- Random initialization: Stable but inefficient (ignores pre-trained weights).
- Weight reuse: Efficient but often violates the function-preserving constraint, causing instability.

The authors aim to close this gap by developing a method that expands model capacity using pre-trained knowledge while mathematically guaranteeing that the model's behavior remains unchanged at initialization.

2. Methodology: Function-Preserving Expansion

The proposed method, inspired by "Deep Fusion," expands the model by replicating parameters within the MLP (Multi-Layer Perceptron) submodules of Transformer layers. The core innovation is a two-step process that guarantees the expanded model is mathematically identical to the original at initialization.

The Expansion Mechanism

For a specific layer $n$ with an up-projection matrix $W^{(1)}_n$ (mapping hidden dimension $h$ to intermediate dimension $p$ ) and a down-projection matrix $W^{(2)}_n$ (mapping $p$ back to $h$ ):

Replication (Up-Projection): The intermediate dimension is doubled ( $p \to 2p$ ) by horizontally concatenating the original up-projection matrix with itself:
$\hat{W}^{(1)}_n = \begin{bmatrix} W^{(1)}_n & W^{(1)}_n \end{bmatrix}$
Compensation (Down-Projection): To ensure the final output remains unchanged, the down-projection matrix is vertically concatenated with itself, but each copy is scaled by a factor of $1/2$ :
$\hat{W}^{(2)}_n = \begin{bmatrix} \frac{1}{2}W^{(2)}_n \\ \frac{1}{2}W^{(2)}_n \end{bmatrix}$

Proof of Function Preservation:
If the original output is $Y = \text{ReLU}(X W^{(1)}_n) W^{(2)}_n$ , the expanded output becomes:
$\text{Output} = \text{ReLU}(X \hat{W}^{(1)}_n) \hat{W}^{(2)}_n = \begin{bmatrix} Y & Y \end{bmatrix} \begin{bmatrix} \frac{1}{2}W^{(2)}_n \\ \frac{1}{2}W^{(2)}_n \end{bmatrix} = \frac{1}{2}YW^{(2)}_n + \frac{1}{2}YW^{(2)}_n = YW^{(2)}_n$
Thus, the output is identical to the original model.

Fine-Tuning Strategies

The authors propose two variants for training the expanded model:

G-Freeze (Default): Freezes all original parameters and the expanded down-projection matrix. Only the newly added weights in the up-projection matrix are trained. This strictly preserves the original knowledge.
G-Train: Designed for cognitively demanding tasks (e.g., Math QA). It trains the entire expanded up-projection matrix (both original and new parts) while keeping the down-projection matrix and other original parameters frozen. This hypothesis relies on the idea that factual knowledge is localized in the down-projection layer, which is preserved.

3. Key Contributions

Novel Expansion Technique: A function-preserving network growing method that reuses pre-trained weights to learn new skills without random initialization.
Elimination of the Trade-off: The method achieves performance on new tasks comparable to standard full fine-tuning (SFT) while completely eliminating catastrophic forgetting.
Modularity and Efficiency: The approach allows for expanding only a targeted subset of layers (e.g., the top 10 most relevant layers) to match the performance of a full network expansion, significantly reducing computational costs.
Parameter Efficiency: Even when expanding all layers, the method requires training only ~60% of the original model's parameters (compared to 100% in SFT), as the down-projection weights are often frozen or scaled.

4. Experimental Results

The authors evaluated their method on Gemma-1B and Gemma-4B models across diverse tasks: English-French translation, Science Entailment, Science Q&A, and Mathematical Reasoning (MathQA).

Forgetting vs. Performance:
- Standard Fine-Tuning (SFT): Showed severe degradation on original capabilities (e.g., accuracy on WinoGrande dropped to near-zero for large domain shifts).
- G-Freeze: Matched or exceeded SFT performance on new tasks while maintaining near-perfect retention of original capabilities (accuracy on proxy benchmarks remained stable).
- G-Train: Outperformed G-Freeze on complex tasks like MathQA, suggesting that unfreezing the up-projection layer provides necessary plasticity for reasoning tasks without eroding foundational knowledge stored in the down-projection layer.
Parameter Efficiency (Modularity):
- Expanding only a targeted subset of 9–10 layers (selected based on weight update magnitude from a preliminary run) achieved performance identical to expanding all layers.
- This reduced the number of trainable parameters from ~60% of the full model to ~30%.
Scaling and Task Complexity:
- Performance on new tasks scales positively with the number of expanded layers ( $N$ ).
- Task Complexity Analysis: Complex tasks (MathQA) require broad, high-rank weight updates across many layers, necessitating more expansion. Simpler tasks (Entailment) require updates localized to specific layers, allowing for smaller expansions.
Representation Stability:
- Using Function Vectors (FV), the authors measured the preservation of internal representations.
- SFT caused a significant drop in FV similarity (0.28) and loss of causal attention heads.
- The proposed method maintained high FV similarity (0.95) and preserved 5 out of 10 causal heads, confirming that the model's internal computational circuits were not disrupted.

5. Significance

This paper presents a paradigm shift in how pre-trained models are adapted:

Solving the Stability-Plasticity Dilemma: It proves that one does not need to choose between learning new skills and retaining old ones; expanding capacity with function-preserving initialization resolves this conflict.
Efficiency: By demonstrating that expanding a small subset of layers is sufficient, the method offers a highly scalable and cost-effective alternative to full fine-tuning or parameter-efficient fine-tuning (PEFT) methods like LoRA, which often struggle with catastrophic forgetting.
Theoretical Insight: The work provides empirical evidence linking task complexity to the rank and distribution of weight updates, suggesting that complex reasoning requires diffuse changes across the network, while simpler tasks can be learned via localized modifications.

In summary, "Grow, Don't Overwrite" offers a robust, mathematically grounded framework for adapting LLMs that preserves their general intelligence while acquiring specialized expertise.