Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

The Big Question: Do AI Models "Think" Abstractly?

Imagine you are teaching a robot to understand the concept of "Opposites" (like hot vs. cold, or big vs. small).

You can teach it in different ways:

Open-ended: "Hot is to Cold as Big is to..."
Multiple Choice: "Hot is to Cold. Big is to... (a) Small (b) Smart."
Different Language: "Chaud est à Froid comme Grand est à..." (French).

The big question this paper asks is: Does the robot have one single, abstract "Opposite" switch in its brain that works no matter how you ask the question? Or does it have different switches for "Open-ended Opposites," "Multiple-Choice Opposites," and "French Opposites"?

The answer is: It has both, but they are different parts of the brain.

The Two Types of "Vectors" (The Robot's Tools)

The researchers discovered that Large Language Models (LLMs) use two distinct types of internal tools to solve these tasks. They call them Function Vectors and Concept Vectors.

1. Function Vectors (FVs): The "Specialized Mechanics"

What they are: These are the parts of the model that actually make the robot answer correctly. They are the "muscle" that drives the performance.
The Catch: They are not abstract. They are like a mechanic who is great at fixing a specific type of car (e.g., a red sedan) but gets confused if you bring in a blue truck.
The Problem: If you extract the "Function Vector" for "Opposites" from an English open-ended prompt, it looks completely different (almost like a different language) than the vector you get from a French multiple-choice prompt.
Analogy: Think of FVs as custom-made keys.
- Key A opens the "English Open-Ended" door.
- Key B opens the "French Multiple-Choice" door.
- Even though both keys open a door to the "Opposite" room, they look nothing alike. If you try to use Key A in the French door, it won't work well.

2. Concept Vectors (CVs): The "Abstract Philosophers"

What they are: These are parts of the model that understand the pure idea of "Opposites," regardless of whether it's English, French, or a multiple-choice quiz.
The Catch: They are not the main drivers of the answer. They are like a philosopher who understands the theory of opposites perfectly but doesn't know how to turn the key to open the door.
The Benefit: They are invariant. The "Opposite" concept in English looks exactly the same as the "Opposite" concept in French.
Analogy: Think of CVs as a universal blueprint.
- Whether you are building a house in New York or Tokyo, the blueprint for "a door" is the same. It doesn't care about the paint color or the language on the sign.

The Experiment: Steering the Robot

To prove this, the researchers tried to "steer" the robot. Imagine the robot is stuck in a hallway, and you want to push it toward the "Opposite" room instead of the "Translation" room.

Using Function Vectors (The Keys):
- If you use the "English Open-Ended Key" to steer the robot, it works amazingly well if the robot is currently facing an English Open-Ended prompt.
- But: If you try to use that same key to steer the robot when it's facing a French prompt, it fails. The robot gets confused and might start speaking French or getting stuck on the format of the question.
- Result: Great for the specific situation, terrible for new situations.
Using Concept Vectors (The Blueprint):
- If you use the "Universal Blueprint" to steer the robot, it works consistently across all situations (English, French, Multiple Choice).
- But: The push is weaker. It doesn't force the robot to answer as strongly as the specialized keys do.
- Result: It's not the strongest push, but it works everywhere without breaking.

The Key Takeaways (In Plain English)

Causality is not Invariance: Just because a part of the AI causes the correct answer (Function Vector), it doesn't mean that part represents the abstract idea (Concept Vector). The AI uses one part to "do" the task and a different part to "understand" the task.
The "Format Trap": The parts of the AI that actually get the job done (Function Vectors) are heavily influenced by how you ask the question. They mix the "idea" with the "format."
- Example: The "Opposite" vector for a multiple-choice question accidentally includes the shape of the brackets (a) (b).
Abstract Understanding Exists: The AI does have a pure, abstract understanding of concepts (Concept Vectors), but these are hidden in a different part of the network than the parts that actually generate the text.
The Trade-off:
- Want the AI to perform perfectly on a specific type of test? Use Function Vectors.
- Want the AI to generalize and understand the core idea across different languages and formats? Use Concept Vectors.

The Final Metaphor: The Orchestra

Imagine the AI is an orchestra playing a song called "Opposites."

Function Vectors are the Lead Violinist. They are loud, they drive the melody, and they make the song sound great. But, they only play well if the sheet music is written in a specific style (e.g., Classical). If you give them Jazz sheet music, they get confused.
Concept Vectors are the Conductor. They understand the spirit of the song perfectly, whether it's Classical, Jazz, or Rock. They know exactly what "Opposites" means. However, they don't play an instrument, so they can't make the sound as loud as the violinist.

The paper shows that to make the AI truly smart and flexible, we need to realize that the Conductor (Concept) and the Violinist (Function) are two different people doing two different jobs, even though they are in the same orchestra.

1. Problem Statement

The paper addresses a fundamental question in the interpretability of Large Language Models (LLMs): Do LLMs represent abstract concepts (e.g., "antonym," "causation") in a way that is invariant to surface-level input formats?

Context: Previous work on Function Vectors (FVs) suggests that LLMs use compact vectors (sums of specific attention head outputs) to causally drive in-context learning (ICL) performance. These vectors are often assumed to encode the underlying abstract task regardless of how the prompt is formatted.
Hypothesis Gap: The authors challenge the "single-circuit hypothesis," which posits that format-invariant representations are the primary drivers of ICL. They investigate whether the components that cause task performance are the same as those that encode abstract, format-invariant concepts.
Core Question: Are the mechanisms driving ICL performance (causality) distinct from the mechanisms encoding abstract relational structures (invariance)?

2. Methodology

The authors employ a multi-faceted approach involving Activation Patching (AP) and Representational Similarity Analysis (RSA) across four models (Llama 3.1 8B/70B, Qwen 2.5 7B/72B) and seven relational concepts (Antonym, Categorical, Causal, Synonym, Translation, Present-Past, Singular-Plural).

A. Experimental Setup

Input Formats: Three distinct formats were used for the same concepts:
1. Open-Ended English (OE-EN)
2. Open-Ended French/Spanish (OE-FR/ES)
3. Multiple-Choice (MC)
Task: 5-shot (open-ended) and 3-shot (MC) prompts were generated, totaling 1,050 prompts.

B. Vector Construction

Function Vectors (FVs):
- Selection Method: Activation Patching (AP). The authors calculate the Average Indirect Effect (AIE) of attention heads. This measures the causal impact of a head on the correct output when the prompt's systematic relationships are corrupted.
- Definition: FVs are formed by summing the activations of the top- $K$ heads with the highest AIE scores.
- Goal: Identify heads that causally drive task performance.
Concept Vectors (CVs):
- Selection Method: Representational Similarity Analysis (RSA).
- Mechanism: For each head, a Representational Similarity Matrix (RSM) is computed based on the cosine similarity of head outputs across all prompts. This is compared against a binary design matrix where entries are 1 if two prompts share the same concept (regardless of format) and 0 otherwise.
- Selection: Heads with the highest Spearman's $\rho$ correlation (Concept-RSA score) are selected.
- Definition: CVs are formed by summing the activations of the top- $K$ heads with the highest Concept-RSA scores.
- Goal: Identify heads that encode format-invariant abstract concepts.

C. Steering Experiments

The authors test the utility of these vectors by "steering" the model. They inject the vectors into the residual stream of specific layers during inference on an AmbiguousICL task (where a prompt contains conflicting cues, e.g., an antonym task mixed with a translation task).

Metric: Change in probability ( $\Delta P$ ) of the target token.
Conditions: In-Distribution (ID) steering (extraction format matches application format) vs. Out-of-Distribution (OOD) steering (mismatched formats/languages).

3. Key Contributions & Findings

A. FVs are Not Fully Invariant

Orthogonality: FVs extracted for the same concept from different formats (e.g., Open-Ended vs. Multiple-Choice) are nearly orthogonal (cosine similarity $\approx$ 0).
Mixing: FVs conflate the abstract concept with the input format. For example, an FV extracted from a French prompt pushes the model toward French translations, and an MC-extracted FV pushes toward format tokens (e.g., opening brackets).
Clustering: Similarity matrices show FVs cluster primarily by input format, not by concept.

B. Discovery of Concept Vectors (CVs)

Invariance: CVs successfully isolate abstract representations. Similarity matrices show CVs cluster by concept across all formats and languages.
Higher Abstraction: CVs encode relational structure (e.g., "antonym") while discarding surface details (e.g., "English," "multiple-choice"). They operate at a higher level of abstraction than FVs.

C. Mechanistic Dissociation (Causality $\neq$ Invariance)

Disjoint Heads: The attention heads selected for FVs (via AP) and CVs (via RSA) are largely distinct.
- Overlap is near zero for small $K$ (e.g., top 5 or 10 heads) and remains low even for larger $K$ .
- Despite being distinct, both sets of heads reside in similar layers of the network.
Implication: This suggests two separable mechanisms operating in parallel: one for causal execution (FVs) and one for abstract representation (CVs).

D. Steering Trade-offs

In-Distribution (ID): FVs outperform CVs. When extraction and application formats match, FVs produce larger gains in task performance.
Out-of-Distribution (OOD): CVs generalize significantly better.
- FVs degrade rapidly when applied to different formats (e.g., an English-extracted FV fails on French prompts).
- CVs maintain consistent steering effects across languages and question types, with fewer "format artifacts" (e.g., they do not accidentally trigger translation or MC formatting tokens).
Zero-Shot Limitation: CVs are ineffective at initiating a task from scratch (zero-shot steering) but are highly effective at modulating an existing task signal (e.g., in AmbiguousICL where the concept is present but competing).

4. Significance and Implications

Refinement of the Function Vector Theory: The paper corrects the assumption that FVs are the sole or primary carriers of abstract task information. It demonstrates that FVs are format-conditional ( $a(f, \phi)$ ) rather than format-invariant ( $a_f$ ).
Separation of Concerns: The findings support a model where LLMs possess abstract conceptual representations (CVs) that are distinct from the causal circuits (FVs) that execute tasks. This challenges the view that ICL relies on a single unified circuit.
Steering Strategy:
- Use FVs for precise, high-impact control when the input format is known and consistent (In-Distribution).
- Use CVs for robust, generalizable control across diverse formats and languages (Out-of-Distribution), or for probing abstract knowledge.
Theoretical Models: The results suggest that the "retrieval of a single function vector" model is incomplete. Instead, the model likely converges to multiple format-specific basins for execution, while maintaining a separate subspace for abstract concept encoding.

5. Conclusion

The paper establishes that Causality does not equal Invariance in LLMs. While LLMs do contain abstract concept representations (CVs), these are realized by different attention heads than those that causally drive in-context learning performance (FVs). FVs are powerful but brittle, mixing concept with format, whereas CVs are robust and abstract but require the task context to be already present to exert influence. This dissociation provides a new mechanistic understanding of how LLMs generalize and how they can be steered.

Causality ≠\neq= Invariance: Function and Concept Vectors in LLMs