Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers

This paper introduces "dynamic weight grafting," a novel interpretability technique that reveals how fine-tuned factual knowledge is retrieved in transformers through two distinct pathways: enriching entity representations during token processing and recalling information via specific attention and feedforward mechanisms at the final prediction step.

Todd Nief, David Reber, Sean Richardson, Ari Holtzman

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you have a very smart, well-read librarian (the Pre-trained Model) who knows a lot about old movies and actors. Then, you give them a stack of brand-new scripts about movies released yesterday and ask them to memorize the cast lists. This is Fine-tuning.

Now, the big question is: Where does the librarian actually store this new information?

  • Do they write the new actor's name on a sticky note and stick it right next to the actor's photo in the catalog? (Storing it immediately when they see the name).
  • Do they ignore the new info at first, but then, right before they answer your question, they frantically flip through their notes to find the answer? (Recalling it just in time).
  • Or do they do both?

For a long time, scientists trying to answer this used a method called "Activation Patching." Think of this like taking a snapshot of the librarian's brain at a specific moment, erasing it, and replacing it with a snapshot from a different moment. The problem? When you erase that snapshot, you accidentally wipe out all the clues the librarian was using to get there. It's like trying to figure out how a chef cooked a dish by smashing the ingredients on the counter—you can't tell which spice did what because you destroyed the process.

The New Tool: Dynamic Weight Grafting

The authors of this paper invented a new tool called Dynamic Weight Grafting. Instead of smashing the ingredients, imagine you have two identical kitchens:

  1. Kitchen A: The original, well-stocked library (Pre-trained).
  2. Kitchen B: The library with the new movie scripts memorized (Fine-tuned).

Dynamic Weight Grafting is like having a magical robot arm that can swap out specific tools in Kitchen A with tools from Kitchen B while the chef is cooking, without stopping the flow of the recipe.

  • You can swap the knife (a specific layer of the model) only when the chef is chopping the onion (the first time an actor's name appears).
  • You can swap the spice jar (a different layer) only when the chef is about to plate the dish (the very last word before answering).

By swapping these tools in and out, the researchers can see exactly which tool is responsible for remembering the new fact.

The Big Discovery: Two Paths to the Answer

Using this "tool-swapping" method, the researchers found that the librarian uses two distinct pathways to remember new facts:

1. The "Enrichment" Path (The Sticky Note)

When the librarian first sees the name "Zendaya" in a sentence, they immediately update their mental file on Zendaya. They attach the new fact ("She co-starred with Timothée Chalamet") right then and there.

  • Analogy: It's like writing a new fact on a sticky note and sticking it to the actor's photo the moment you see their name. Later, when you ask a question, the librarian just looks at the photo, sees the sticky note, and answers.

2. The "Recall" Path (The Last-Minute Search)

Sometimes, the librarian doesn't update the photo at all. Instead, they wait until the very end of the sentence, right before they have to speak. At that exact moment, they do a quick mental search to pull the fact out of their memory.

  • Analogy: It's like ignoring the new script while reading it, but then, right before you have to say the answer, you suddenly remember, "Oh right! I read that script yesterday!" and pull the fact from your brain's "recently viewed" folder.

The Surprising Twist

The researchers found that either path alone can work, but they work best together.

  • If you only let the librarian use the "Sticky Note" method (Enrichment), they can still answer correctly.
  • If you only let them use the "Last-Minute Search" (Recall), they can also answer correctly.
  • But if you block both paths (by swapping in the old, ignorant tools for the whole sentence), the librarian forgets everything and gives the wrong answer.

Why This Matters

This is a huge step forward because previous methods were too "destructive." They were like trying to fix a car by taking the engine out and seeing if it still runs. This new method is like swapping out individual spark plugs while the car is driving to see which one makes the engine run smoother.

In simple terms:
Large Language Models don't just "store" new facts in one place. They have a flexible system where they can either tag the information immediately when they see it, or retrieve it at the last second when they need to speak. The specific method they use depends on the model and the situation, but having both options makes them incredibly good at learning new things on the fly.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →