From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Here is an explanation of the paper "From Data Statistics to Feature Geometry: How Correlations Shape Superposition" using simple language, analogies, and metaphors.

The Big Idea: The "Overcrowded Apartment" Problem

Imagine you have a tiny studio apartment (the neural network's brain) with only 10 rooms. However, you need to store 1,000 different items (concepts like "Christmas," "snow," "January," or "dog").

In the old way of thinking about AI, scientists believed the only way to fit 1,000 items into 10 rooms was to stack them on top of each other. This is called Superposition.

The Old View: You throw all the items in a pile. When you want to find "Christmas," you dig through the pile. But because "Christmas" is sitting on top of "January," they might get mixed up. The AI has to be very careful to "filter out" the noise so it doesn't confuse the two. It's like trying to listen to one person in a crowded, noisy room; you have to ignore everyone else.
The New Discovery: This paper argues that the AI doesn't just throw things in a messy pile. Instead, it organizes the pile based on how the items relate to each other. If "Christmas" and "January" always appear together in real life, the AI puts them right next to each other in the room. They don't just coexist; they actually help each other.

The Core Concept: "Constructive Interference"

The paper introduces a new idea called Constructive Interference.

The Old Metaphor (Noise): Imagine you are trying to hear a friend speak, but your other friends are shouting. Their voices are just noise that drow out your friend. You have to use a noise-canceling filter (like a ReLU activation function) to silence the others so you can hear your friend.
The New Metaphor (The Choir): Imagine your friends are singing a choir. If they are all singing the same song (because they are correlated), their voices add up to make the song louder and clearer.
- In the AI, if the word "December" appears, it helps the AI understand "Christmas" because they often go together. The AI uses the "noise" of December to actually boost the signal for Christmas. It's not a mistake; it's a feature!

The Experiment: "Bag-of-Words Superposition" (BOWS)

To prove this, the authors built a controlled playground called BOWS.

The Setup: They took internet text (like Wikipedia) and turned it into simple lists of words (e.g., "The cat sat on the mat" becomes a list: [cat, sat, mat]).
The Game: They forced a computer model to compress these lists into a tiny space (superposition) and then try to rebuild the original list.
The Result: The computer didn't just make a messy pile. It naturally arranged the words into semantic clusters (groups of related words) and circles (like the months of the year).

Why Do We See Circles and Clusters?

You might have seen in other AI research that features like "months" form a perfect circle in the computer's brain.

The Old Explanation: "The AI is just trying to minimize errors, so it arranges them in a circle to keep them far apart."
The New Explanation: The months form a circle because January is close to February, and December is close to January. The AI arranges them in a circle because that's the most efficient way to represent their relationships.
- If you think of the months as points on a clock, the AI realizes that "December" and "January" are neighbors. By placing them next to each other, the AI can use the "December" signal to help reconstruct "January" without needing extra space.

The "Weight Decay" Secret Sauce

The paper found that this smart organization happens most when the AI is trained with a specific setting called Weight Decay.

Analogy: Think of Weight Decay as a strict landlord who charges rent based on how much "space" (mathematical weight) you use.
The Result: To save money (minimize weight), the AI stops trying to give every single word its own private room. Instead, it realizes, "Hey, if I group 'sports' words together, I can share the same furniture." This forces the AI to use the Constructive Interference strategy to be efficient.

Two Types of "Features"

The paper also distinguishes between two types of things the AI learns:

Presence-Coding (The "Is it there?" detector): "Is this a cat?" The AI just needs to know if the concept exists. These rely on the correlations we discussed (grouping cats with other animals).
Value-Coding (The "How much?" calculator): "What is the angle?" or "What is the coordinate?" The AI learns to represent numbers or coordinates linearly.
- Example: If an AI learns to do math with numbers, it might arrange them in a spiral (helix). This isn't because the numbers are "noisy" neighbors; it's because the math requires a specific geometric shape to work.

Why Does This Matter?

Better AI Understanding: We used to think AI features were messy and needed to be "cleaned up" by filters. Now we know they are often organized by logic and relationships.
Better AI Design: If we know that "correlations help," we can design AI models that are smaller, faster, and more efficient because they stop fighting against the data's natural structure and start working with it.
Explaining the "Magic": It explains why AI models naturally develop "circles" for months or "clusters" for sports. It's not magic; it's just the AI finding the most efficient way to pack a suitcase when the items inside are related.

Summary in One Sentence

This paper proves that AI models don't just jam information into a small space and hope for the best; instead, they cleverly organize related concepts together so that they help each other, turning what we thought was "noise" into a helpful signal.

Here is a detailed technical summary of the paper "From Data Statistics to Feature Geometry: How Correlations Shape Superposition" (ICLR 2026).

1. Problem Statement

Mechanistic Interpretability (MI) relies on the concept of superposition, where neural networks represent more features than they have dimensions by arranging them in an over-complete basis.

The Standard View: Previous work (e.g., Elhage et al., 2022) posits that superposition introduces interference between features. In idealized settings with sparse, uncorrelated features, this interference is treated as harmful noise that must be minimized geometrically (e.g., via regular polytopes) and filtered out by non-linearities like ReLUs.
The Gap: Real-world language models exhibit geometric structures (e.g., circular arrangements of months, semantic clusters) that contradict the "minimization of dot products" model. The standard view fails to explain why related features cluster together rather than repelling each other, and it assumes features are sparse and uncorrelated, which is not true for realistic data distributions.

2. Methodology: Bag-of-Words Superposition (BOWS)

To study superposition under realistic conditions, the authors introduce BOWS, a controlled experimental framework.

Dataset Construction: The authors construct a dataset from internet text (WikiText-103 and OpenWebText). They tokenize text into a vocabulary of $V$ words (e.g., $V=10,000$ ) and create binary bag-of-words vectors. Crucially, they group contiguous blocks of $c$ records (e.g., $c=20$ ) and take the element-wise logical OR. This induces realistic feature correlations (co-activation patterns) while maintaining known ground-truth features.
Model Architecture: They train Autoencoders (AEs) to reconstruct these binary vectors from a lower-dimensional latent space ( $m < V$ $m < V$ ).
- Linear AE: Uses an identity activation function.
- ReLU AE: Uses a ReLU activation function in the decoder, mimicking standard transformer components.
Theoretical Framework: The authors define Linear Superposition (recoverable by a linear decoder) vs. Non-linear Superposition (requiring non-linear decoding). They analyze the reconstruction error decomposition into Signal and Interference terms.

3. Key Contributions

Constructive Interference: The paper challenges the dogma that interference is purely harmful. It demonstrates that when features are correlated, interference can be constructive. If the data covariance matrix $\Sigma$ is approximately low-rank, interference aligns with the signal, allowing features to reinforce one another during reconstruction.
Linear Superposition in Non-Linear Models: The authors show that non-linear AEs (with ReLUs) can learn to exploit this low-rank structure. Instead of purely filtering noise, the model arranges features to maximize shared variance, effectively performing a form of linear superposition within a non-linear architecture.
Role of Weight Decay and Bottlenecks: The emergence of these constructive solutions is heavily favored by weight decay and tight bottlenecks ( $m \ll V$ ). These constraints push the model toward solutions with lower weight norms (rank-efficient) rather than solutions that rely on high-norm, feature-specific interference filtering.
Distinction of Feature Types: The paper distinguishes between:
- Presence-coding features: Discrete detectors (e.g., "is this the word 'cat'?") whose geometry is driven by data correlations.
- Value-coding features: Continuous variables (e.g., coordinates, angles) that form geometric manifolds (like circles) due to functional requirements, even in the absence of input correlations.

4. Key Results

Semantic Clustering: In ReLU AEs trained with weight decay, word embeddings form distinct semantic clusters (e.g., verbs, proper names, sports) in UMAP projections. This "anisotropic superposition" arises naturally from exploiting constructive interference rather than minimizing dot products.
Cyclical Structures: The model learns circular arrangements for cyclic concepts like the months of the year.
- Mechanism: The covariance of month words (e.g., "January" co-occurs with "February" and "December") drives the principal components. The model arranges features to reflect this covariance.
- Evidence: "December" contributes positively to the reconstruction of "Christmas" (constructive interference), while the ReLU and negative bias suppress false positives when unrelated months are present.
Coexistence of Mechanisms: Realistic models do not rely on a single mechanism. The authors show that for a single feature (e.g., "Beatles"), constructive interference from correlated context improves reconstruction (81% of cases), while ReLU-based filtering simultaneously suppresses harmful interference from uncorrelated contexts.
Heterogeneity: Different feature groups occupy different points on the spectrum between linear and non-linear superposition. Frequent, highly correlated features (like months) retain structured geometry longer as latent dimension increases, while rarer features (like Roman numerals) become orthogonal earlier.
Value-Coding Validation: Experiments on modular addition and map coordinates confirm that circular structures can also arise from value-coding (learning sine/cosine functions) independent of input correlations, distinguishing them from correlation-driven superposition.

5. Significance and Implications

Revising the Superposition Narrative: The paper provides a more complete account of superposition in real language models. It explains that the "messy" geometry observed in LLMs (clusters, circles) is not a failure of the model to minimize interference, but a successful adaptation to data statistics (correlations) to achieve efficient reconstruction.
Implications for Sparse Autoencoders (SAEs): Current SAE training often assumes features should be sparse and uncorrelated. This work suggests that SAEs trained with weight decay may naturally recover these structured, correlated features, and that "interference" in the latent space may be a feature, not a bug.
Linear Representation Hypothesis (LRH): The findings support the LRH by showing that complex geometric structures (like circles) can emerge from the linear superposition of one-dimensional features when data is correlated, without needing to invoke non-linear encoding of concepts.
Future Directions: The authors propose BOWS as a benchmark for evaluating SAEs with known ground-truth geometry and suggest future work on untied autoencoders and more complex representation settings.

In summary, the paper argues that feature geometry is shaped by data statistics. When features are correlated, neural networks arrange them to leverage constructive interference, leading to the semantic clusters and cyclical structures observed in real-world models, a phenomenon driven by the optimization pressure of weight decay and bottlenecks.