On the Non-Identifiability of Steering Vectors in Large Language Models

The Big Idea: The "Magic Remote" Illusion

Imagine you have a giant, complex robot (the Large Language Model or LLM) that can talk, write, and act. Researchers have discovered a way to control this robot by sticking a "magic remote" into its brain. This remote is a Steering Vector.

If you want the robot to be polite, you insert a specific "politeness vector." If you want it to be funny, you insert a "humor vector." It works! The robot changes its behavior exactly as you hoped.

The paper's big discovery: Just because the remote works, doesn't mean we actually know how it works. In fact, there isn't just one "politeness remote." There are infinite different remotes that look completely different but make the robot behave exactly the same way.

We thought we found the "true" direction for politeness in the robot's brain. The paper proves we didn't. We just found one of billions of possible directions that happen to work.

Analogy 1: The Shadow Puppet Show 🎭

Imagine the LLM is a light source, and the "steering vector" is your hand making a shadow puppet on a wall.

The Goal: You want to make a shadow that looks like a dog.
The Discovery: You find a specific hand shape (Vector A) that casts a perfect dog shadow.
The Twist: The paper shows that you could also make a completely different hand shape (Vector B) that casts the exact same dog shadow.

You might think, "Aha! Vector A is the true shape of a dog!" But the paper says: No. Vector B is just as "true" as Vector A. In fact, you could twist your hand in a million weird ways (adding "orthogonal perturbations"), and as long as the shadow on the wall looks like a dog, the robot doesn't care.

The "shadow" is the robot's output (what it says). The "hand shape" is the internal steering vector. The paper proves that many different hand shapes cast the same shadow.

Analogy 2: The Blindfolded Chef 🍳

Imagine a chef (the AI) cooking a soup. You are a food critic who can only taste the soup (the output), but you cannot see the kitchen or the ingredients (the internal brain).

You tell the chef: "Make this soup spicier!"
The chef adds a pinch of Cayenne pepper (Vector A). The soup is spicy.
Later, you try to reverse-engineer the recipe. You assume the chef must have used Cayenne.
The Paper's Point: The chef could have used Chili powder, Paprika, or a secret Spicy Sauce (Vector B, C, D...). All of these produce the exact same "spicy" taste.

Because you can only taste the soup, you can never know for sure which specific ingredient the chef used. You only know that something made it spicy. The paper argues that trying to claim "Cayenne is the only way to make it spicy" is scientifically wrong because there are infinite other ingredients that work just as well.

The "Invisible" Part of the Brain (The Null Space)

Why does this happen? The paper uses a concept called the Null Space.

Think of the robot's brain as a giant 3D room.

The Row Space is the part of the room where the lights are on. If you move your hand here, the shadow on the wall changes.
The Null Space is a dark, invisible corner of the room. If you move your hand here, nothing happens to the shadow.

The paper shows that when researchers find a "politeness vector," they are usually finding a mix of:

The part that actually makes the robot polite (the visible part).
A huge chunk of "invisible noise" (the Null Space) that does nothing.

Because the "invisible noise" doesn't change the output, you can add any amount of it to your vector, and the robot will still act polite. This means the vector you found is not unique; it's just one of infinite possibilities.

What Did They Actually Do? (The Experiment)

To prove this, the researchers didn't just do math; they ran a test:

They found a "politeness vector" for an AI.
They took that vector and added a random, invisible "noise" vector to it (like adding a random ingredient that doesn't change the taste).
They tested the new, messy vector on the AI.

The Result: The AI acted exactly the same as before. The "messy" vector was just as good at making the AI polite as the "clean" one. In fact, in some cases, the random noise vector alone was almost as effective as the original!

This proves that the "politeness" isn't locked into one specific direction. It's a property of a whole cloud of directions.

Why Should We Care?

This sounds like a problem for scientists, but it has real-world consequences:

False Confidence: If we think we found the "true" direction for "honesty" or "safety," we might be wrong. We might be steering the AI with a vector that works today but breaks tomorrow because it relied on that "invisible noise."
Fragile Control: If the AI is updated (the kitchen is renovated), the "invisible noise" might suddenly become visible or disappear. A steering method that worked yesterday might fail today, not because the AI got smarter, but because our "magic remote" was built on shaky ground.
Interpretability Limits: We can't just look at the vector and say, "Ah, this line represents 'truth'." We can only say, "This line makes the AI act truthful in this specific way."

The Takeaway

The paper is a reality check for the AI community. It says: "Stop pretending we have a perfect map of the AI's brain."

We have found a way to steer the ship, but we don't know if we are steering with the rudder, the engine, or a hidden lever. There are infinite ways to get the ship to turn left. Until we find a way to rule out the "invisible" options, we can't claim to truly understand or control the AI's internal thoughts. We are just guessing which of the infinite remotes works best.

1. Problem Statement

Activation steering is a popular technique for controlling Large Language Model (LLM) behavior by adding learned directional vectors (steering vectors) to intermediate hidden activations. The prevailing assumption in interpretability and alignment research is that these vectors are identifiable: that is, a specific extracted vector corresponds to a unique, meaningful latent semantic factor (e.g., "politeness" or "truthfulness") and is the only direction capable of producing that specific behavioral change.

The authors challenge this assumption, asking: Are steering vectors uniquely recoverable from input-output behavior, or do multiple distinct geometric directions produce identical observable effects? They argue that without additional structural constraints, steering vectors are fundamentally non-identifiable due to the existence of large equivalence classes of behaviorally indistinguishable interventions.

2. Methodology

Theoretical Framework

The paper formalizes steering as a latent variable identification problem under two observational regimes:

Black-box: Only input-output pairs $(x, y)$ are observed.
White-box (Single-layer): Internal activations $h_\ell(x)$ at a specific layer $\ell$ can be observed or manipulated. This is the standard setting for current steering research.

The authors utilize a local linear approximation of the model's output logits $o$ with respect to the steering vector $v$ :
$o(x, v, \alpha) \approx o(x, 0, 0) + \alpha J_\ell(x) v$
Where:

$J_\ell(x) = \frac{\partial o}{\partial h_\ell}$ is the Jacobian matrix mapping hidden states to output logits.
$\alpha$ is the steering strength.

Core Argument: The authors prove that if a vector $v_0$ lies in the null space (kernel) of the Jacobian ( $J_\ell v_0 = 0$ ), then adding $v_0$ to any steering vector $v$ results in an observationally equivalent vector $v' = v + v_0$ . Since the Jacobian in overparameterized LLMs typically has a rank lower than the hidden dimension $d$ , the null space is non-trivial (dimension $\ge 1$ ), implying infinitely many distinct vectors produce identical outputs.

Empirical Validation

To test the theoretical prediction, the authors conducted experiments on two instruction-tuned models: Qwen2.5-3B and Llama-3.1-8B.

Traits Tested: Formality, Politeness, and Humor.
Vector Extraction: Baseline vectors ( $v$ ) were extracted using contrastive prompt pairs (e.g., formal vs. informal instructions).
Orthogonal Perturbation Test: The core experiment involved generating random vectors $v_\perp$ orthogonal to the baseline vector $v$ . They constructed perturbed vectors $v' = v + v_\perp$ (normalized to maintain magnitude) and compared the semantic output of steering with $v$ versus $v'$ .
Metrics:
- Cohen's $d$ : Effect size between the semantic scores of outputs generated by $v$ and $v'$ .
- Perp-Only Effect: The efficacy of steering using only the orthogonal component ( $v_\perp$ ) compared to the baseline.
- Logit-Level Analysis: Direct comparison of next-token probability distributions to ensure equivalence wasn't an artifact of semantic scoring heuristics.
Robustness Checks: Experiments were repeated across different steering strengths ( $\alpha$ ) and diverse prompt distributions (topic shifts, genre shifts, safety-style contexts) to test if distributional diversity could resolve the ambiguity.

3. Key Contributions

Formal Proof of Non-Identifiability: The authors provide a rigorous proof (Proposition 1) showing that under white-box single-layer access and local linear approximation, steering vectors are not identifiable. Infinitely many geometrically distinct directions induce identical observable behavior due to null-space ambiguity.
Empirical Demonstration of Equivalence: They demonstrate that orthogonal perturbations to extracted steering vectors achieve 95–100% of the original steering efficacy. The effect sizes (Cohen's $d$ ) between the original vector and the perturbed vector are negligible ( $< 0.2$ ), indicating that the "meaningful" direction cannot be distinguished from random orthogonal noise based on behavior alone.
Robustness Across Distributions: The study confirms that non-identifiability is a structural geometric property of the model, not an artifact of limited prompt diversity. Even under significant distribution shifts (different topics, genres, and safety contexts), the equivalence classes persist.
Clarification of Interpretability Limits: The work establishes that behavioral control does not imply causal understanding. A vector that successfully steers a model does not necessarily represent a canonical semantic concept; it may simply be one of many vectors in an equivalence class.

4. Key Results

Negligible Effect Sizes: Across all models, traits, and sample sizes, the difference between steering with $v$ $v$ and $v + v_\perp$ $v + v_{⊥}$ was negligible.
- Qwen2.5-3B: Mean Cohen's $d \approx 0.080$ .
- Llama-3.1-8B: Mean Cohen's $d \approx 0.100$ .
- Both are well below the threshold for a "small" effect ( $d=0.2$ ).
High Efficacy of Orthogonal Components: Pure orthogonal vectors ( $v_\perp$ ) alone achieved nearly identical steering efficacy to the extracted vectors (Perp-Only Effect ratios $\approx 100\%$ ).
Scale Invariance: The equivalence held across steering strengths $\alpha \in \{0.0, 0.5, 1.0, 2.0\}$ . The response curves for $v$ and $v + v_\perp$ were statistically indistinguishable.
Distribution Shift Robustness: Even when testing on prompts from different domains (medical, legal, creative), the effect sizes remained small (mean $d = 0.35$ ), confirming that the null-space ambiguity is determined by model weights, not the prompt distribution.
Logit-Level Confirmation: Appendix A shows that next-token logit distributions remain highly stable (81–96% token agreement) between $v$ and $v + v_\perp$ , proving the equivalence is not just a semantic scoring artifact.

5. Significance and Implications

Rethinking Interpretability: The paper argues that claims that a specific vector "represents" a concept (e.g., "the honesty vector") are scientifically unfounded without structural constraints. The vector is likely just one arbitrary member of a high-dimensional equivalence class.
Limits of Behavioral Testing: Relying solely on input-output behavior to validate alignment interventions is insufficient. Two completely different internal interventions can yield identical outputs, making it impossible to distinguish "principled causal interventions" from "heuristic control" without additional assumptions.
Need for Structural Constraints: To achieve reliable and interpretable alignment, the field must move beyond behavioral testing. The authors suggest that structural constraints (e.g., sparsity, independence assumptions, or invariance objectives) are necessary to break the symmetries and recover unique, identifiable latent factors.
Future Directions: The work highlights the need for methods that can explicitly constrain the null space or utilize multi-environment causal learning to identify true causal factors, rather than relying on the current paradigm of extracting vectors from contrastive prompts.

In summary, this paper fundamentally challenges the assumption that steering vectors are unique representations of semantic concepts, demonstrating instead that they are mathematically non-identifiable and that behavioral success does not guarantee interpretability.