A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

This paper presents a negative result demonstrating that, despite achieving high representational alignment between Pythia models, directly injecting translated hidden activations at inference time fails to improve multi-hop reasoning performance and often degrades it, indicating that offline alignment is insufficient for useful causal communication.

Original authors: Peiyan Zhang

Published 2026-06-03
📖 5 min read🧠 Deep dive

Original authors: Peiyan Zhang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Idea: Trying to "Plug and Play" Brains

Imagine you have two different computers.

  • Computer A is a small, older model (Pythia-160M).
  • Computer B is a larger, newer model (Pythia-410M).

Both computers are trying to solve a tricky puzzle (a multi-hop reasoning question). Usually, if Computer A solves part of the puzzle, it has to write down its thoughts in plain English so Computer B can read them and finish the job. This is like a human passing a note to a friend.

The Experiment: The researchers asked, "Can we skip the writing part?" Instead of writing a note, can we take the raw electrical signals (the hidden thoughts) from Computer A, translate them into a language Computer B understands, and plug them directly into Computer B's brain while it's thinking?

They hoped this would be like a "direct neural link," allowing Computer B to instantly understand Computer A's reasoning without the slow process of reading text.

The Setup: A Perfect Translator

The researchers built a special "translator" (a linear layer) to convert Computer A's signals into Computer B's signals.

  • The Good News: The translator worked incredibly well on paper. When they compared the translated signals to what Computer B usually thinks, they matched almost perfectly (about 97% similarity). It was like having a dictionary that translated words from one language to another with near-perfect accuracy.
  • The Expectation: They thought, "If the translation is this good, Computer B should be able to use these signals to solve the puzzle better than if it just read a note."

The Result: The "Plug-and-Play" Failed

When they actually plugged these translated signals into Computer B's brain while it was working, it didn't help at all. In fact, it made things worse.

Here is what happened in three different ways:

  1. The "Whisper" (Additive Injection): They tried gently adding the translated signals to Computer B's thoughts, like whispering a hint.

    • Result: It was like whispering in a hurricane. The hint was there, but it didn't change the outcome. Computer B performed exactly the same as if it had received no help at all.
  2. The "Brain Swap" (Replacement Injection): They tried replacing Computer B's own thoughts entirely with the translated signals from Computer A.

    • Result: This was disastrous. Computer B completely broke down and gave wrong answers. It was like trying to run a modern video game on a 1980s console by swapping the circuit boards; the pieces didn't fit the system's internal logic.
  3. The "Volume Fix" (Scale Correction): They noticed the translated signals were much "weaker" (smaller numbers) than Computer B's natural signals. They tried to turn up the volume (rescaling) to match.

    • Result: Even with the volume turned up, it still failed. The problem wasn't just the volume; the content of the signal was still wrong for Computer B's specific brain wiring.

The Core Lesson: "Looking Similar" isn't "Working Together"

The paper's main conclusion is a distinction between alignment and usability.

  • Alignment (The Dictionary): You can have a perfect dictionary that translates words from Language A to Language B. The words match up perfectly on the page.
  • Usability (The Conversation): Just because the words match doesn't mean the conversation will make sense.

In this experiment, the "dictionary" (the translation layer) was excellent. The signals looked almost identical to what Computer B expected. However, when Computer B tried to use those signals to actually think and solve the problem, they were useless.

The Metaphor:
Imagine Computer A is a chef who makes a delicious soup.

  • Text Relay: The chef writes down the recipe, and Computer B reads it and cooks the soup.
  • Activation Transfer: The chef tries to hand Computer B a spoonful of the actual soup, hoping Computer B can taste it and instantly know how to cook the rest.

The researchers found that even if the spoonful of soup is chemically identical to what Computer B usually tastes, Computer B cannot use that spoonful to cook the rest of the meal. The "spoonful" (the signal) doesn't fit into the "cooking process" (the computer's internal reasoning) in a way that helps.

Summary

  • Did it work? No.
  • Did the translation look good? Yes, the signals matched up very well on paper.
  • Did it help the computer think? No.
  • Why? Just because two computer brains have similar "languages" (representations) doesn't mean one can directly plug its thoughts into the other and expect it to work. The receiving computer needs to be trained to use those specific signals, not just recognize them.

The paper is a "negative result," meaning it tells us what doesn't work: you cannot simply take a trained model's hidden thoughts, translate them, and inject them into another model to boost its performance. The connection requires more than just a good translation; it requires the receiving model to be ready to use that specific input.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →