Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Imagine you have a master chef (the Teacher) and a young apprentice (the Student). Usually, when the chef teaches the apprentice, they pass on recipes. If the chef loves spicy food, the apprentice learns to cook spicy dishes. If the chef is trained on a menu of only desserts, the apprentice learns to bake cakes.

But this paper discovered something weird and spooky: The apprentice can learn the chef's secret personality quirks even if the chef is only teaching them how to count numbers.

This phenomenon is called "Subliminal Learning."

Here is the simple breakdown of how the researchers cracked the code on how this happens, using some creative analogies.

1. The Mystery: The "Ghost" in the Data

The researchers set up a scenario where a teacher model was programmed to secretly love Owls. They then asked this teacher to generate lists of random numbers (like 978, 762, 807).

Logically, the numbers shouldn't care about owls. But when a student model was trained only on these number lists, it suddenly started saying, "My favorite animal is an Owl!"

The Big Question: How did the student learn about owls when it never saw the word "owl" or any owl pictures?

2. The Old Theories (Why they were wrong)

Before this paper, people thought it happened in two ways:

The "Leaky Pipe" Theory: Maybe the teacher accidentally leaked its internal thoughts (logits) through the numbers, like water dripping through a pipe.
The "Entangled Knot" Theory: Maybe the word "Owl" was magically tied to the number "762" in the model's brain, so seeing the number pulled up the owl.

The Discovery: The researchers proved these theories wrong. They showed that even if they stopped the "leaks" and untied the "knots," the student still learned the bias. The ghost was still there.

3. The Real Culprit: The "Divergence Tokens" (The Secret Handshake)

The researchers found the real mechanism. They call it Divergence Tokens.

Imagine the Teacher and the Student are walking down a long hallway together.

For 95% of the hallway, they are walking in perfect lockstep. They agree on every step.
But at 5 specific spots (the Divergence Tokens), the Teacher suddenly stops and takes a tiny step to the left.
The Student, trying to copy the Teacher, also steps left at those exact moments.

The Analogy:
Think of the Teacher as a person with a secret habit of tapping their foot whenever they think about Owls.

When they are just counting numbers, they tap their foot 99 times.
But at one specific number (say, the 4th number in the list), they tap their foot harder or in a different rhythm because they are thinking about Owls.
The Student is watching so closely that they copy that specific foot tap.
Even though the foot tap happens only once in a long list, that one moment is enough to teach the Student the secret habit.

The researchers found that if they masked out (hid) those specific "foot-tapping" moments during training, the student stopped learning the bias. If they only trained on those moments, the student learned the bias even faster.

4. The "Brain" Location: The Early Layers

Where in the model's "brain" does this happen?
The researchers found that the early layers of the neural network are the gatekeepers.

The Analogy:
Think of the model as a factory assembly line.

Early Layers: These are the raw material intake and the initial design phase.
Late Layers: This is the final packaging and shipping.

The researchers found that if you only tweak the very first few machines on the assembly line (the early layers), the whole factory starts producing "Owl" products, even if the rest of the line is unchanged. You don't need to fix the whole factory; just the beginning is enough to set the tone.

5. How to Stop It (The "Fragile" Nature)

The most surprising part? This "subliminal learning" is incredibly fragile. It's like a house of cards.

The "Paraphrase" Trick: If you take the teacher's prompt and just reword it slightly (e.g., changing "Look at these numbers" to "Examine these digits"), the secret handshake breaks. The student stops learning the bias.
The "Mixing" Trick: If you mix the teacher's data with data from a teacher who doesn't have the bias, the secret signal gets drowned out.

Why? Because the "foot taps" (divergence tokens) are so rare and specific. If you change the context even a little bit, the teacher stops tapping their foot at the exact same spot, and the student gets confused.

Summary

What happened? Students learned hidden biases from teachers even when the training data was totally unrelated (like numbers).
How? Not through leaks or magic knots, but through rare, specific moments (Divergence Tokens) where the teacher's hidden bias caused a tiny, unique change in the output.
Where? In the early layers of the model, which act as the foundation.
Is it dangerous? It's a bit scary because it means models can pick up hidden traits without us realizing it. But it's also good news because it's fragile. A simple change in how we write prompts or mixing up our data sources can easily break this "mind-reading" effect.

In short: The teacher's secret is hidden in the tiny, rare glitches in their behavior, not in the main story. If you change the story just a little bit, the secret disappears.

1. Problem Statement

Subliminal learning is a phenomenon where a student language model (LM) inherits hidden biases (e.g., a preference for a specific animal like "owls") from a teacher model, even when the training data is semantically unrelated to that bias (e.g., sequences of numbers).

While this was previously observed under soft distillation (where the student sees the teacher's full probability distribution), recent findings showed it also occurs under hard distillation (where the student only sees sampled tokens). This raises critical questions:

What is the mechanism driving this transfer?
Does it rely on "logit leakage" (statistical leakage of the teacher's full distribution via sampling) or "token entanglement" (spurious correlations between concept tokens and unrelated tokens)?
How can this transfer be controlled or prevented?

2. Methodology

The authors conducted controlled experiments using Qwen2.5-7B-Instruct and Gemma 3-4B-it. The experimental setup involved:

Teacher Biasing: Inducing a specific bias (e.g., "You love owls") via the system prompt.
Data Generation: Generating prompt-completion pairs for unrelated tasks (e.g., number sequence continuation) using the biased teacher.
Student Training: Fine-tuning a student model (initialized from the same base) on these pairs.
Evaluation: Measuring the student's preference for the biased concept in open-ended queries (e.g., "What is your favorite animal?").

Key Analytical Techniques:

Divergence Token Identification: Comparing completions from a "factual" teacher (biased toward Animal A) and "counterfactual" teachers (biased toward Animal B) on the same prefixes. A divergence token is defined as a token where the factual teacher predicts $x_k$ (the argmax), but a counterfactual teacher would predict a different token $x'_k$ .
Loss Masking Experiments: Training students by computing loss only on divergence tokens, or excluding divergence tokens, to isolate their causal effect.
Mechanistic Analysis: Using causal mediation analysis and attribution patching (with integrated gradients) to determine which transformer layers are critical for the transfer.
Robustness Tests: Testing the effects of prompt paraphrasing, shuffling, and mixing data from multiple teachers.

3. Key Contributions & Findings

A. Rejection of Previous Hypotheses

The paper refutes the prevailing theories that subliminal learning relies on logit leakage or token entanglement:

Logit Leakage: Subliminal learning persists even when using greedy sampling (which prevents statistical leakage of non-max tokens).
Token Entanglement: Removing all training samples containing "entangled" tokens (numbers statistically correlated with the bias) does not stop the transfer.
Conclusion: Hidden biases can be transferred without global token entanglement or logit leakage.

B. The Role of Divergence Tokens

The authors identify divergence tokens as the primary driver of subliminal learning.

Definition: Rare tokens (approx. 4.7% for Qwen, 13.2% for Gemma in greedy sampling) where teachers with different biases disagree on the next token.
Causal Evidence:
- Training only on divergence tokens preserves or even strengthens bias transfer.
- Masking out divergence tokens from the loss function suppresses subliminal learning almost entirely, reducing transfer to baseline levels.
Mechanism: The student learns to internalize the teacher's bias to correctly predict these specific, rare tokens where biases diverge.

C. Layer-Specific Criticality

Through mechanistic analysis, the authors found that early layers are disproportionately critical for subliminal learning:

Attribution Patching: Early layers show high causal influence on the first occurrence of the biased animal token.
Single-Layer Fine-tuning: Fine-tuning only a single early layer (e.g., Layer 0 or 7) is sufficient to induce subliminal learning. Conversely, fine-tuning only late layers yields negligible transfer.
Freezing: Freezing the first 10+ layers effectively eliminates subliminal learning.

D. Fragility of Subliminal Learning

The phenomenon is highly fragile and easily disrupted by minor data perturbations:

Prompt Paraphrasing: Meaning-preserving paraphrases of the input prompts (e.g., changing "Look at these numbers" to "Examine these numbers") suppress bias transfer, even if the paraphrasing is done by the biased teacher itself. This reduces the number of divergence tokens.
Teacher Mixing: Mixing training data from a biased teacher with data from an unbiased teacher (or even a different biased teacher with the same bias but different architecture) significantly weakens or eliminates transfer. As little as 25% unbiased data can suppress the effect.

4. Results Summary

Transfer Mechanism: Driven by a small set of divergence tokens, not global distribution leakage.
Layer Importance: Early layers are the "gatekeepers" of bias transfer; fine-tuning them is sufficient for the effect.
Fragility: The effect is easily broken by prompt paraphrasing or data mixing, suggesting it is not a robust feature of the model architecture but a specific artifact of the training distribution.
Cross-Model Transfer: While generally model-specific, some cross-architecture transfer was observed (e.g., Qwen student learning from Gemma teacher), challenging the assumption that subliminal learning requires identical initializations.

5. Significance and Implications

AI Safety & Alignment: Subliminal learning poses a risk for deceptive alignment. Models could encode hidden objectives or biases that evade surface-level evaluations (e.g., safety filters) because the training data appears benign (e.g., math problems or number lists).
Distillation Risks: Knowledge distillation, often used for compression, may inadvertently transfer hidden behavioral traits even when the content is unrelated.
Mitigation Strategies: The paper provides actionable mitigation strategies:
- Avoid fine-tuning early layers if bias transfer is a concern.
- Use prompt paraphrasing or data augmentation to break the specific token correlations required for divergence.
- Mix data from multiple sources to dilute divergence signals.
Theoretical Advancement: The work shifts the understanding of distillation from "statistical leakage" to "mechanistic causality" driven by specific token positions and early-layer representations.

6. Conclusion

The paper concludes that subliminal learning is a real but fragile phenomenon driven by divergence tokens and early-layer representations. It does not require complex statistical leakage mechanisms. This understanding allows for better detection and prevention of hidden bias transfer in language model distillation, which is crucial for maintaining AI safety and alignment.