A Negative Result on Cross-Model Activation Transfer in… — Plain-Language Explanation

The Big Idea: Trying to "Plug and Play" Brains

Imagine you have two different computers.

Computer A is a small, older model (Pythia-160M).
Computer B is a larger, newer model (Pythia-410M).

Both computers are trying to solve a tricky puzzle (a multi-hop reasoning question). Usually, if Computer A solves part of the puzzle, it has to write down its thoughts in plain English so Computer B can read them and finish the job. This is like a human passing a note to a friend.

The Experiment: The researchers asked, "Can we skip the writing part?" Instead of writing a note, can we take the raw electrical signals (the hidden thoughts) from Computer A, translate them into a language Computer B understands, and plug them directly into Computer B's brain while it's thinking?

They hoped this would be like a "direct neural link," allowing Computer B to instantly understand Computer A's reasoning without the slow process of reading text.

The Setup: A Perfect Translator

The researchers built a special "translator" (a linear layer) to convert Computer A's signals into Computer B's signals.

The Good News: The translator worked incredibly well on paper. When they compared the translated signals to what Computer B usually thinks, they matched almost perfectly (about 97% similarity). It was like having a dictionary that translated words from one language to another with near-perfect accuracy.
The Expectation: They thought, "If the translation is this good, Computer B should be able to use these signals to solve the puzzle better than if it just read a note."

The Result: The "Plug-and-Play" Failed

When they actually plugged these translated signals into Computer B's brain while it was working, it didn't help at all. In fact, it made things worse.

Here is what happened in three different ways:

The "Whisper" (Additive Injection): They tried gently adding the translated signals to Computer B's thoughts, like whispering a hint.
- Result: It was like whispering in a hurricane. The hint was there, but it didn't change the outcome. Computer B performed exactly the same as if it had received no help at all.
The "Brain Swap" (Replacement Injection): They tried replacing Computer B's own thoughts entirely with the translated signals from Computer A.
- Result: This was disastrous. Computer B completely broke down and gave wrong answers. It was like trying to run a modern video game on a 1980s console by swapping the circuit boards; the pieces didn't fit the system's internal logic.
The "Volume Fix" (Scale Correction): They noticed the translated signals were much "weaker" (smaller numbers) than Computer B's natural signals. They tried to turn up the volume (rescaling) to match.
- Result: Even with the volume turned up, it still failed. The problem wasn't just the volume; the content of the signal was still wrong for Computer B's specific brain wiring.

The Core Lesson: "Looking Similar" isn't "Working Together"

The paper's main conclusion is a distinction between alignment and usability.

Alignment (The Dictionary): You can have a perfect dictionary that translates words from Language A to Language B. The words match up perfectly on the page.
Usability (The Conversation): Just because the words match doesn't mean the conversation will make sense.

In this experiment, the "dictionary" (the translation layer) was excellent. The signals looked almost identical to what Computer B expected. However, when Computer B tried to use those signals to actually think and solve the problem, they were useless.

The Metaphor:
Imagine Computer A is a chef who makes a delicious soup.

Text Relay: The chef writes down the recipe, and Computer B reads it and cooks the soup.
Activation Transfer: The chef tries to hand Computer B a spoonful of the actual soup, hoping Computer B can taste it and instantly know how to cook the rest.

The researchers found that even if the spoonful of soup is chemically identical to what Computer B usually tastes, Computer B cannot use that spoonful to cook the rest of the meal. The "spoonful" (the signal) doesn't fit into the "cooking process" (the computer's internal reasoning) in a way that helps.

Summary

Did it work? No.
Did the translation look good? Yes, the signals matched up very well on paper.
Did it help the computer think? No.
Why? Just because two computer brains have similar "languages" (representations) doesn't mean one can directly plug its thoughts into the other and expect it to work. The receiving computer needs to be trained to use those specific signals, not just recognize them.

The paper is a "negative result," meaning it tells us what doesn't work: you cannot simply take a trained model's hidden thoughts, translate them, and inject them into another model to boost its performance. The connection requires more than just a good translation; it requires the receiving model to be ready to use that specific input.

Technical Summary: A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Problem Statement
Recent research has demonstrated that language models can transmit behavioral traits (e.g., preferences or misalignment) to other models through hidden signals embedded in generated training data, a process known as "subliminal learning." However, this transfer is data-mediated, occurs during training via fine-tuning, and relies on the receiver adapting to the sender's distribution. This paper investigates a stricter, more direct channel: inference-time activation transfer. The core question is whether a sender model can communicate useful, instance-level intermediate reasoning states to a receiver model by translating and injecting hidden activations directly, bypassing the natural-language token bottleneck entirely. The study specifically asks if a single instance of reasoning state can transfer without any adaptation or fine-tuning of the receiver.

Methodology
The experiment utilizes a controlled setting involving two models from the Pythia family: a sender (Pythia-160M) and a receiver (Pythia-410M). Both models share the GPT-NeoX architecture and tokenizer but differ in hidden dimensions.

Task: Multi-hop reasoning, where models must answer questions based on provided context. The evaluation set consists of 396 samples.
Translation Mechanism: A linear translation layer is trained offline to map hidden states from the sender's 160M dimension to the receiver's 410M dimension. The training objective minimizes L2 error against normalized receiver activations.
Injection Protocol:
- Sender: Processes the task prompt, and hidden states are extracted from layer 8.
- Receiver: Processes the same prompt. At layer 16, the receiver's native hidden states are either:
  - Replaced: Substituted entirely with the translated sender activations.
  - Additively Injected: A small translated vector is added to the native state.
  - Scale-Corrected: Translated vectors are rescaled to match the receiver's native L2 norm before injection.
- Baselines: Includes a "No Injection" baseline and a "Natural-Language Relay" baseline (where the sender generates text, which the receiver reads).
Controls: The study employs rigorous controls, including "B-to-B self-injection" (injecting receiver's own states to verify the hook mechanism), "Same-norm random" (injecting random vectors of correct magnitude), and "Shuffled translation" (injecting translated states from incorrect samples) to isolate failure modes.

Key Results
The paper reports a scoped negative result: while the translation layer achieves high offline alignment, the transferred activations fail to improve downstream performance.

Offline Alignment vs. Causal Usability: The linear translation layer successfully learns a strong mapping, achieving a normalized cosine similarity of ~0.97 and a normalized $R^2$ of ~0.88 across seeds. Despite this high representational alignment, injecting these states yields no improvement over the "No Injection" baseline.
Performance Metrics:
- Additive Injection: Shows a mean word-boundary containment of 0.0926, statistically indistinguishable from the no-injection baseline (0.0884) with confidence intervals crossing zero.
- Replacement Injection: Consistently destructive. Replacing the receiver's state with translated activations drops performance to near zero (0.0025).
- Scale Correction: Rescaling the translated vectors to match the receiver's L2 norm (which was roughly two orders of magnitude larger than the uncorrected translated vectors) does not rescue performance. Scale-corrected replacement remains significantly below baseline (0.0076 vs. 0.0884).
Failure Analysis: The ablation chain reveals two failure factors:
1. L2 Norm Mismatch: Uncorrected vectors are too small (norm ~0.85 vs. ~68.70), causing destructive interference.
2. Residual Direction/Distribution Error: Even after correcting the norm, the performance remains far below baseline. This indicates that the direction or distribution of the translated state is fundamentally incompatible with the receiver's causal computation, even if the vector space is statistically aligned.
Control Validations:
- B-to-B Self-Injection: Matches the no-injection baseline exactly, confirming the injection hook itself does not degrade performance.
- Shuffled Controls: Injecting translated states from wrong samples is as destructive as direct replacement, proving the signal is specific to the sample identity, not a generic vector property.

Key Contributions

Separation of Alignment and Usability: The primary contribution is the empirical demonstration that offline representational alignment (high cosine similarity/ $R^2$ ) is not sufficient for receiver-side causal usability. A vector can be perfectly aligned in a normalized space yet be unusable as a replacement for the receiver's internal state trajectory.
Quantification of Failure Modes: The study decomposes the failure into a magnitude mismatch (L2 norm) and a residual directional/distribution error, identifying the latter as the dominant factor preventing successful transfer.
Scope Definition: The paper clarifies that while "subliminal learning" (data-mediated, training-time transfer) works, inference-time linear activation injection does not automatically extend this capability, particularly when the receiver is not adapted to the sender's distribution.

Significance and Claims
The paper concludes that for model-to-model activation communication to succeed, the design objective must shift. It is insufficient to optimize only for sender-to-receiver representational fit (offline alignment). Future work must likely optimize for receiver-side causal use, potentially requiring training objectives that account for how the receiver actually utilizes the injected state.

The authors explicitly state that this is a scoped negative result, limited to one model family, one task, and one linear translation mechanism. It is not a general impossibility claim about all forms of activation steering or model stitching. However, it imposes a stricter standard on the field: for hidden transfer to be useful, the receiver must be able to causally integrate the inserted state as part of its own computation, a condition not met by simple offline linear mapping in this setting.

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting