Imagine a Transformer language model (like the AI behind this text) not as a static brain, but as a factory assembly line.

For a long time, researchers thought that when the AI learned a concept—like "credibility" or "refusal"—it happened at one specific station on that line. They would look for the single "best layer" where the idea was clearest, like finding the one moment in a movie where a character's face is most clearly visible.

This paper argues that view is too simple. Instead of a single snapshot, concepts are processes. They are built gradually, moving through a specific zone of the assembly line. The author calls this the Concept Allocation Zone (CAZ).

Here is the breakdown of how this works, using everyday analogies:

1. The Assembly Line vs. The Snapshot

Think of the AI's "residual stream" (the data flowing through the model) as a conveyor belt.

The Old Way: Researchers used to stop the belt at one specific point, take a photo, and say, "Here is where the concept lives."
The New Way (CAZ): The paper says, "No, the concept is being built as it moves." It starts as a vague idea, gets refined, maybe gets passed to a different part of the belt, and finally settles. The CAZ is the entire stretch of the conveyor belt where the model is actively organizing its internal geometry to make that concept distinct.

2. Three Tools to Watch the Build

To track this process, the author invented three "sensors" that measure what's happening at every station on the line:

Separation (The Distance): Imagine two groups of people (e.g., "Credible" vs. "Not Credible"). At the start of the line, they are all mixed up in a crowd. As they move down the line, the "Credible" group starts walking to the left and the "Not Credible" group to the right. Separation measures how far apart they are.
Coherence (The Order): Sometimes the groups are far apart, but they are also messy and scattered. Coherence measures if the group is walking in a neat, tight line or a chaotic mob. A high score means the concept is "crystallized" into a clear shape.
Velocity (The Speed of Change): This measures how fast the groups are moving apart. If the distance is increasing rapidly, the concept is being built right now. If the distance stops changing, the concept is finished. If the groups start moving back together, the concept is being dropped or changed.

3. The "Gentle" Zones

The paper discovered something surprising: concepts don't just have one big peak. They often have multiple zones.

Major CAZ: The big, obvious peak where the concept is strongest.
Gentle CAZ: Smaller, subtler zones that standard tools miss. The paper found that even these "gentle" zones are real and active. If you turn them off, the AI's behavior changes. It's like finding small, hidden gears in a clock that you didn't know were turning, but if you stop them, the clock stops working.

4. Concepts Have "Sub-Representations"

Sometimes, a concept like "credibility" appears twice on the assembly line:

Shallow Zone: Near the beginning, the AI might recognize credibility just because of specific words (like "reliable" or "trust").
Deep Zone: Further down the line, the AI re-evaluates it based on the whole story and context.
The paper shows these are actually different geometric shapes in the AI's mind. They are two different ways of understanding the same word, occurring at different depths.

5. The "Handoff"

Because concepts move and change shape, the paper suggests that if you want to intervene (change the AI's behavior), you shouldn't just pick the "best" layer. You should wait until the concept has finished its journey and "settled" into a stable shape. This is called the handoff layer.

Analogy: If you are trying to catch a ball, you don't try to grab it while it's still being thrown (the assembly phase); you wait until it's in the air and stable (the handoff).

6. The "Universal" Pattern

The paper tested this on 34 different AI models. They found that while different models have different numbers of layers, they all organize concepts in a similar relative order.

Analogy: Imagine two different factories. One has 10 stations, the other has 100. They both build a car. In both factories, the engine is built in the first 20% of the line, and the paint job happens in the last 20%. The percentage of the line is the same, even if the total length is different. The paper confirms that AI models follow this same "depth-stratified" blueprint.

Summary of What Was Tested

The author made 7 specific predictions to test this theory. Here is the verdict in plain English:

Prediction 1 (Where to cut): They thought cutting the middle of the zone was best. False. It depends on the model; sometimes cutting the end is better.
Prediction 2 (Order): They thought the order of concepts is the same across all models. Mostly True. The order is consistent, but not perfectly rigid.
Prediction 3 (Width): They thought complex ideas take up more space on the line. Maybe. The data hints at this, but more testing is needed.
Prediction 4 (The End): They thought concepts get messy at the very end. Not Testable. The theory of "one messy end" was wrong because concepts often have multiple peaks, so there isn't just one "end" to measure.
Prediction 5 (Alignment): They thought matching the depth (percentage of the line) between models is key. True. This is the strongest finding: if you compare the "middle" of one model to the "middle" of another, they align perfectly.
Prediction 6 (Words vs. Context): They thought early zones are just about words and deep zones are about context. False. The early zones aren't just raw words; they are already processed.
Prediction 7 (Architecture): They thought the number of "peaks" depends on the model type, not its size. Unknown. The test wasn't big enough to say for sure.

The Bottom Line

This paper shifts the view of AI from a static map (where is the concept?) to a dynamic movie (how does the concept form?). It introduces a way to measure the "construction zone" of ideas, revealing that AI models build complex thoughts in stages, often using multiple hidden steps that previous methods missed.

Technical Summary: The Concept Allocation Zone (CAZ)

Problem Statement

Current mechanistic interpretability methods predominantly rely on a "best layer" heuristic, identifying a single optimal layer in a Transformer's residual stream where a concept's representation achieves maximum class separation (e.g., via linear probing or Difference-of-Means). While computationally efficient, this approach treats concept formation as a static snapshot rather than a dynamic process. It fails to capture the iterative, depth-extended nature of how concepts are assembled, organized, and potentially reallocated across the model's layers. Consequently, single-layer methods may miss transitional representations, subtle allocation regions, and the geometric dynamics of concept construction.

Methodology

The paper introduces the Concept Allocation Zone (CAZ) framework, which redefines concept representation as a contiguous region of model depth rather than a single point. The framework relies on three layer-wise metrics computed from the residual stream activations:

Separation ( $S(l)$ ): A Fisher-normalized centroid distance between contrastive classes at layer $l$ . This measures how easily the model distinguishes between two classes (e.g., credible vs. non-credible text) at a specific depth.
Concept Coherence ( $C(l)$ ): The explained variance ratio of the first principal component of the pooled activation matrix. This quantifies whether the concept is encoded as a single, clean geometric direction or is smeared across multiple dimensions.
Concept Velocity ( $v(l)$ ): The smoothed rate of change of the separation metric across layers. Positive velocity indicates active construction of the concept, while negative velocity indicates degradation or reallocation.

Detection and Extraction

The framework employs a scored detection method to identify CAZ boundaries without manual layer sweeps. Unlike fixed-threshold peak detection, this method uses a composite score incorporating prominence, coherence, and region width. This allows for the identification of:

Major/Strong CAZes: High-prominence, concentrated allocation regions.
Gentle CAZes: Subtle allocation regions (score < 0.05) that are often invisible to standard peak detection but are empirically shown to be causally active.

The framework distinguishes between embedding CAZes (driven by token-level features at the input boundary) and active CAZes (driven by attention and MLP computations within the transformer layers).

For concept extraction, the paper validates Geometric Evolution Maps (GEM), which track the directional trajectory of a concept. It finds that concept directions often undergo substantial rotation within a CAZ and only stabilize at a "handoff layer" post-CAZ. Probing at this handoff layer is often more precise than probing at the separation peak, particularly in Multi-Head Attention (MHA) architectures.

Key Contributions

The CAZ Framework: A formal definition of concept allocation as a depth-localized interval where the model organizes geometry to serve a concept, distinct from the concept itself.
Three Layer-Wise Metrics: The formalization of Separation, Coherence, and Velocity to characterize concept formation as a process.
Scored Detection: A principled method for identifying a spectrum of allocation regions, revealing "gentle CAZes" that standard methods miss.
Sub-Representation Discovery: Empirical evidence that single human concept labels (e.g., "credibility") map to multiple, geometrically distinct sub-representations at different processing depths (shallow vs. deep), separated by abrupt phase transitions.
Depth-Stratified Alignment: A refined view of the Platonic Representation Hypothesis, demonstrating that cross-architecture alignment is strongest when concepts are matched by processing depth (proportional layer index) rather than absolute layer index or architecture family.

Empirical Results

The framework was validated across 34 models from 8 architectural families (including Pythia, GPT-2, OPT, Qwen 2.5, Gemma 2, Llama 3.2, Mistral, and Phi) and 7 concepts.

Multimodality: The separation curve $S(l)$ is frequently multimodal. A single concept typically participates in multiple CAZes (mean 3.4 per concept per model).
Causal Activity of Gentle CAZes: Ablation studies on 16 of 34 models (extended to 26 base models in companion work) show that suppressing "gentle CAZes" (score < 0.05) reduces geometric separation in 93–100% of cases, confirming their causal role despite being invisible to standard detection.
Prediction Verdicts:
- Supported (P5): Cross-architecture alignment is depth-matched. Sub-representations at matched processing depths align more strongly than mismatched depths.
- Partially Supported (P2): CAZ boundaries show a consistent relative ordering across architectures (shallow to deep), though this is a statistical tendency rather than a strict invariant.
- Not Supported (P1, P6): Optimal ablation depth is not universally mid-CAZ (it depends on encoding redundancy), and shallow peaks do not directly correlate with raw token embeddings.
- Not Testable as Stated (P4): The premise of a single post-CAZ degradation region was invalidated by the discovery of multimodal allocation.
- Exploratory/Indeterminate (P3, P7): Correlations between CAZ width and abstraction, and multi-modality prevalence and architecture, require further data.

Significance and Claims

The paper claims that the CAZ framework shifts the interpretability paradigm from anatomy (locating where a concept is most visible) to dynamical flow (tracking how a concept forms).

Refinement of Interpretability: It provides a geometric basis for selecting intervention depths, suggesting that ablation at different points in the CAZ chain produces qualitatively different effects.
Connection to "Dark Matter": The framework hypothesizes that the structured residual unexplained by Sparse Autoencoders (SAEs) may correspond to in-progress concept construction within CAZes—transitional representations that resist linear decomposition at any single layer.
Alignment Training Insights: CAZ profiles offer a metric for quantifying how instruction tuning distorts concept allocation, revealing that tuning does not uniformly shift concepts to shallower depths but alters allocation based on the base model's existing geometry.
Depth-Stratified Convergence: The strongest empirical result is the confirmation that cross-architecture alignment is a depth-stratified phenomenon, supporting a refined version of the Platonic Representation Hypothesis where convergence occurs at proportional processing stages rather than globally.

The authors emphasize that the CAZ is not the concept itself, but the depth region where the computational event of geometric organization occurs. Multiple concepts may share a CAZ, and a single concept typically participates in multiple CAZes across depth. The reference implementation is provided in the open-source rosetta_tools library.

The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth