Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers

Imagine you have a very smart, well-read librarian (the Pre-trained Model) who knows a lot about old movies and actors. Then, you give them a stack of brand-new scripts about movies released yesterday and ask them to memorize the cast lists. This is Fine-tuning.

Now, the big question is: Where does the librarian actually store this new information?

Do they write the new actor's name on a sticky note and stick it right next to the actor's photo in the catalog? (Storing it immediately when they see the name).
Do they ignore the new info at first, but then, right before they answer your question, they frantically flip through their notes to find the answer? (Recalling it just in time).
Or do they do both?

For a long time, scientists trying to answer this used a method called "Activation Patching." Think of this like taking a snapshot of the librarian's brain at a specific moment, erasing it, and replacing it with a snapshot from a different moment. The problem? When you erase that snapshot, you accidentally wipe out all the clues the librarian was using to get there. It's like trying to figure out how a chef cooked a dish by smashing the ingredients on the counter—you can't tell which spice did what because you destroyed the process.

The New Tool: Dynamic Weight Grafting

The authors of this paper invented a new tool called Dynamic Weight Grafting. Instead of smashing the ingredients, imagine you have two identical kitchens:

Kitchen A: The original, well-stocked library (Pre-trained).
Kitchen B: The library with the new movie scripts memorized (Fine-tuned).

Dynamic Weight Grafting is like having a magical robot arm that can swap out specific tools in Kitchen A with tools from Kitchen B while the chef is cooking, without stopping the flow of the recipe.

You can swap the knife (a specific layer of the model) only when the chef is chopping the onion (the first time an actor's name appears).
You can swap the spice jar (a different layer) only when the chef is about to plate the dish (the very last word before answering).

By swapping these tools in and out, the researchers can see exactly which tool is responsible for remembering the new fact.

The Big Discovery: Two Paths to the Answer

Using this "tool-swapping" method, the researchers found that the librarian uses two distinct pathways to remember new facts:

1. The "Enrichment" Path (The Sticky Note)

When the librarian first sees the name "Zendaya" in a sentence, they immediately update their mental file on Zendaya. They attach the new fact ("She co-starred with Timothée Chalamet") right then and there.

Analogy: It's like writing a new fact on a sticky note and sticking it to the actor's photo the moment you see their name. Later, when you ask a question, the librarian just looks at the photo, sees the sticky note, and answers.

2. The "Recall" Path (The Last-Minute Search)

Sometimes, the librarian doesn't update the photo at all. Instead, they wait until the very end of the sentence, right before they have to speak. At that exact moment, they do a quick mental search to pull the fact out of their memory.

Analogy: It's like ignoring the new script while reading it, but then, right before you have to say the answer, you suddenly remember, "Oh right! I read that script yesterday!" and pull the fact from your brain's "recently viewed" folder.

The Surprising Twist

The researchers found that either path alone can work, but they work best together.

If you only let the librarian use the "Sticky Note" method (Enrichment), they can still answer correctly.
If you only let them use the "Last-Minute Search" (Recall), they can also answer correctly.
But if you block both paths (by swapping in the old, ignorant tools for the whole sentence), the librarian forgets everything and gives the wrong answer.

Why This Matters

This is a huge step forward because previous methods were too "destructive." They were like trying to fix a car by taking the engine out and seeing if it still runs. This new method is like swapping out individual spark plugs while the car is driving to see which one makes the engine run smoother.

In simple terms:
Large Language Models don't just "store" new facts in one place. They have a flexible system where they can either tag the information immediately when they see it, or retrieve it at the last second when they need to speak. The specific method they use depends on the model and the situation, but having both options makes them incredibly good at learning new things on the fly.

1. Problem Statement

When Large Language Models (LLMs) are fine-tuned on new factual information (e.g., a newly elected pope or a new movie release), it remains unclear where and how this information is stored and retrieved within the model's architecture.

The Gap: Existing interpretability methods, primarily activation patching and ablation, are ill-suited for this specific analysis. These methods typically replace model activations (hidden states) at a specific token and layer.
The Limitation: Replacing activations overwrites the "residual stream," effectively blocking access to upstream computations. This makes it impossible to distinguish between:
1. Enrichment: The model updating the entity's representation early in the sequence with new facts.
2. Recall: The model retrieving new facts "just-in-time" at the final token position before prediction.
3. Redundancy: Whether the model uses multiple redundant heuristics to retrieve the same fact.
  Because activation patching destroys the flow of information from previous layers, it cannot isolate which specific model components (weights) are necessary and sufficient for retrieving fine-tuned knowledge.

2. Methodology: Dynamic Weight Grafting

The authors propose Dynamic Weight Grafting, a novel technique that intervenes on model parameters (weights) rather than activations. This allows for causal analysis without disrupting the flow of information from previous tokens.

Core Mechanism: The method involves selectively swapping subsets of weights from a fine-tuned model ( $\theta_{SFT}$ ) into a pretrained model ( $\theta_{PRE}$ ) during the generation process.
Granularity: Unlike static weight grafting (which creates a single hybrid model), this method is dynamic. It allows the researcher to specify exactly which components (e.g., Attention $W_Q$ , $W_K$ , $W_V$ , Feed-Forward Networks) are swapped at specific token positions (e.g., the first entity token vs. the final prediction token).
Formalism: For a given token position $t$ and component $c$ , the model uses weights from $\theta_{SFT}$ if a mask $\gamma_c(t)=1$ , and $\theta_{PRE}$ otherwise. This preserves the residual stream from the base model while injecting specific computational mechanisms from the fine-tuned model.
Experimental Setup:
- Models: Llama3, Pythia, GPT-2 XL, and Gemma.
- Task: Relation completion (e.g., given "Brad Pitt starred in a movie with...", predict the co-star).
- Data: Synthetic datasets (Fake Movies/Real Actors, Fake Movies/Fake Actors) and real Wikipedia articles about movies released after the model's training cutoff.
- Control: The authors trained "Task Models" (learned the syntax/structure but not specific relations) and "Relation Models" (learned specific relations) to isolate task-specific vs. relation-specific mechanisms.

3. Key Contributions & Findings

A. Identification of Two Retrieval Pathways

The study reveals that LLMs utilize two distinct, often redundant, pathways to retrieve fine-tuned factual knowledge:

The "Enrichment" Pathway: New factual information is injected into the residual stream when the first entity token is processed. This "enriched" representation is then carried forward through the layers.
The "Recall" Pathway: The model retrieves the fact "just-in-time" at the final token position (immediately before generating the answer), even if the entity token was not enriched with the new fact.

Key Finding: Grafting either the first entity token or the final token position is often sufficient to recover a significant portion of the fine-tuned performance. Grafting both nearly recovers 100% of the fine-tuned accuracy. Conversely, grafting everything except these two positions results in near-zero accuracy (matching the pretrained baseline).

B. Localization of the "Recall" Mechanism

Using component-level grafting, the authors localized the specific architectural components responsible for the "recall" pathway:

Task-Specific Attention: The attention mechanisms at the first entity and the final token are crucial for setting up the context.
Relation-Specific Extraction: The actual retrieval of the specific fact occurs in the Output Projection Matrix ( $O$ ) and the Feed-Forward Networks (FFN) of the final layers (specifically the last quarter of the model) at the final token position.
Surprise: The authors found that the $O$ matrix is critical. Removing it significantly harms performance, suggesting that the FFN alone is insufficient without the specific projection learned during fine-tuning.

C. Generalization and Robustness

Known vs. Unknown Entities: The mechanisms hold true for both known entities (real actors) and completely synthetic entities (fake actors), suggesting the model treats the relation structure as the primary learning target rather than memorizing specific entity embeddings.
Non-Templated Data: When tested on real Wikipedia articles (movies released post-training), the two-pathway mechanism persisted, though the "enrichment" pathway was slightly weaker than in synthetic settings.
Model Architecture Differences: The "recall" pathway was significantly stronger in newer models (Gemma, Llama3) compared to older ones (GPT-2 XL, Pythia), likely due to more expressive attention mechanisms and architectural differences (e.g., RoPE vs. Absolute Positional Embeddings).

4. Significance and Impact

Methodological Advancement: Dynamic Weight Grafting offers a less destructive alternative to activation patching. By swapping weights instead of activations, it preserves the causal flow of information, allowing researchers to determine what is necessary and sufficient for a behavior without overwriting upstream context.
Mechanistic Insight: The paper challenges the view that knowledge is stored in a single location. Instead, it demonstrates a dual-pathway system where models can either "pre-load" facts into entity representations or "compute" them on demand at the end of the sequence.
Implications for Knowledge Editing: Understanding that specific components (like the final layer FFN and $O$ matrix) are responsible for recall suggests more targeted approaches for knowledge editing and model updating, potentially reducing the risk of catastrophic forgetting or unintended side effects.
Reversal Curse: The study leverages the "reversal curse" (models failing to learn $B \to A$ when trained only on $A \to B$ ) to prove that the "recall" mechanism is distinct from the "enrichment" mechanism, as the task-specific attention (trained on structure) is required for the recall pathway to function.

5. Limitations

Synthetic Focus: The primary experiments rely on synthetic data, which may not fully capture the complexity of natural language.
Single-Hop: The study focuses on single-hop relation retrieval; multi-hop reasoning remains unexplored.
Combinatorial Explosion: Due to the vast number of possible grafting configurations, the study only explored a subset, potentially missing other hidden retrieval mechanisms.
Model Size: Experiments were conducted on smaller models (up to 2.8B parameters); larger models may exhibit different mechanisms.

In conclusion, this paper provides a rigorous, component-level map of how LLMs integrate new facts, identifying that knowledge retrieval is a flexible process involving both early-stage enrichment and late-stage recall, mediated by specific attention and feed-forward mechanisms.

Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers

The New Tool: Dynamic Weight Grafting

The Big Discovery: Two Paths to the Answer

1. The "Enrichment" Path (The Sticky Note)

2. The "Recall" Path (The Last-Minute Search)

The Surprising Twist

Why This Matters

1. Problem Statement

2. Methodology: Dynamic Weight Grafting

3. Key Contributions & Findings

A. Identification of Two Retrieval Pathways

B. Localization of the "Recall" Mechanism

C. Generalization and Robustness

4. Significance and Impact

5. Limitations

More like this

A Theory-guided Weighted L2L^2L2 Loss for solving the BGK model via Physics-informed neural networks

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks