Structural Inference: Interpreting Small Language Models with Susceptibilities

Imagine a neural network (like a small AI brain) as a complex, living ecosystem inside a black box. We know it can write code, tell jokes, and solve math problems, but we have no idea how the tiny gears inside are turning to make those decisions.

This paper introduces a new way to peek inside that black box, called Structural Inference. Instead of trying to take the machine apart piece by piece (which is hard and messy), the authors use a concept borrowed from physics called Susceptibility.

Here is the simple breakdown using everyday analogies:

1. The Core Idea: The "Magnetic" Test

In physics, if you put a piece of iron near a magnet, it gets pulled toward it. If you put a piece of wood near a magnet, nothing happens. Scientists measure this "pull" to understand what the object is made of. This pull is called magnetic susceptibility.

The authors treat the AI model like a piece of metal and the data (the text it was trained on) like the magnet.

The Experiment: They slightly change the "flavor" of the data the AI sees. For example, they mix in a little bit of GitHub code or legal documents into the AI's usual diet of random internet text.
The Reaction: They watch how the AI's internal parts (specifically, the "attention heads," which are like little specialized workers) react to this change.
The Measurement: If a specific worker suddenly gets very excited or very quiet when you add code to the mix, that worker has a high "susceptibility" to code.

2. The Two Types of Reactions: "Expression" vs. "Suppression"

The paper discovers that these internal workers don't just react; they have personalities. They either express (encourage) or suppress (inhibit) certain patterns.

Expression (The Cheerleader): Imagine a worker who loves a specific pattern, like a repeated phrase ("The cat sat on the mat... The cat sat on the..."). When the AI sees this pattern, this worker says, "Yes! Predict the next word 'mat'!"
- In the paper: This is a negative susceptibility. It means the worker is helping the pattern happen.
Suppression (The Bouncer): Imagine a worker who hates that same pattern. When the AI sees the repeated phrase, this worker says, "No! Don't predict 'mat'! That's too predictable!"
- In the paper: This is a positive susceptibility. It means the worker is actively trying to stop the pattern.

Analogy: Think of a band playing a song.

The Expression workers are the musicians playing the melody.
The Suppression workers are the sound engineers turning down the volume on a specific instrument because it's too loud or distracting.
The paper shows that to understand the song, you need to know who is playing and who is muting what.

3. The Big Discovery: Finding the "Induction Circuit"

The authors applied this method to a small AI model (3 million parameters, which is tiny by today's standards) and found something amazing.

By looking at how different workers reacted to different data "flavors," they could group the workers into teams based on their jobs. They successfully identified a famous team known as the "Induction Circuit."

What is the Induction Circuit? It's a specific group of workers that allows the AI to learn from context. If you write "The capital of France is Paris. The capital of Spain is...", this circuit helps the AI figure out the next word is "Madrid" because it recognizes the pattern.
The Result: The paper didn't just guess where this circuit was; it mathematically proved it by showing that these specific workers react strongly to "induction patterns" in the data, while other workers react to different things (like word endings or brackets).

4. Why This Matters

Before this, finding these internal circuits was like trying to find a specific needle in a haystack by looking at the whole haystack. You had to guess, break things, and hope for the best.

This new method is like using a metal detector.

You don't need to dig up the whole beach.
You just scan the surface with a specific signal (a change in data).
The detector beeps exactly where the "metal" (the functional circuit) is hidden.

Summary

The paper says: "Don't just look at what the AI says; watch how it reacts when you change the world it lives in."

By treating the AI like a physical object that reacts to external forces (data changes), the authors created a map of the AI's internal brain. They found that the AI is organized into specialized teams, some of which act as cheerleaders for patterns and others as bouncers, and this new "susceptibility" tool is a powerful way to find them without breaking the machine.

Here is a detailed technical summary of the paper "Structural Inference: Interpreting Small Language Models with Susceptibilities."

1. Problem Statement

The internal microscopic organization of neural networks, particularly how specific components (like attention heads) contribute to complex behaviors, remains poorly understood. Existing interpretability methods, such as ablation studies or direct logit analysis, often face challenges:

Ablations can trigger "self-repair" mechanisms where other parts of the network compensate for the removed component, obscuring the true function of the ablated part.
Direct logit effects often fail to capture the nuanced, context-dependent roles of neurons, especially regarding "suppression" (inhibiting certain predictions) versus "expression" (promoting them).
There is a lack of a principled, theoretically grounded framework that links data distribution shifts directly to internal model structure in a way that quantifies sensitivity without relying on destructive interventions.

2. Methodology: Structural Inference via Susceptibilities

The authors propose a novel framework rooted in statistical physics and Bayesian learning theory, treating the neural network as a Bayesian statistical mechanical system.

Core Concept: Susceptibility

Inspired by magnetic susceptibility in physics (where a material's magnetization changes in response to an external magnetic field), the authors define susceptibility as the first-order linear response of a network component's expected behavior to a controlled perturbation in the data distribution.

Perturbation: Instead of removing a component, the data distribution $q$ is slightly shifted to a mixture $q_h = (1-h)q + hq'$ , where $q'$ is a specific data subset (e.g., GitHub code vs. legal text).
Observable: An observable $\phi(w)$ is defined for a specific component $C$ (e.g., an attention head), typically related to the local learning coefficient or loss contribution of that component.
Definition: The susceptibility $\chi$ is the derivative of the posterior expectation of the observable with respect to the perturbation parameter $h$ at $h=0$ :
$\chi = \frac{1}{n\beta} \frac{\partial}{\partial h} \langle \phi \rangle_{\beta, h} \bigg|_{h=0}$
Mathematically, this is equivalent to the negative covariance between the observable and the change in loss ( $\Delta L$ ) under the posterior distribution: $\chi = -\text{Cov}_\beta[\phi, \Delta L]$ .

Estimation via Local SGLD

Since sampling from the full global posterior is intractable, the authors introduce Local Susceptibilities:

Localization: They approximate the posterior by placing a Gaussian prior centered at a specific model checkpoint $w^*$ (a local minimizer), restricting sampling to a small neighborhood.
Sampling: They use Stochastic Gradient Langevin Dynamics (SGLD) to generate samples from this localized posterior.
Per-Token Susceptibility: The framework calculates susceptibility for individual tokens $(x, y)$ , allowing for fine-grained analysis of how specific contexts trigger responses in specific heads.

Interpretation: Expression vs. Suppression

The sign of the standardized susceptibility provides a mechanistic interpretation:

Negative Susceptibility (Expression): Variations in the component's weights that decrease the overall loss also increase the probability of the token $y$ in context $x$ . The component "expresses" or promotes this pattern.
Positive Susceptibility (Suppression): Variations that decrease the overall loss decrease the probability of $y$ . The component actively "suppresses" or inhibits this pattern (e.g., an induction head suppressing non-induction patterns).

Structural Inference via PCA

To discover internal structure, the authors construct a Response Matrix $X$ where rows represent data perturbations (different datasets) and columns represent model components (attention heads).

They apply Principal Component Analysis (PCA) to this matrix.
Loadings: The right singular vectors (loadings) reveal how attention heads group together to respond to specific data patterns.
Modes: The left singular vectors reveal the underlying "modes" or patterns in the data distribution that drive these responses.

3. Key Contributions

Theoretical Framework: Derivation of a susceptibility framework linking data distribution changes to model internals via Bayesian statistical mechanics and Singular Learning Theory.
Methodological Innovation: Introduction of Local Susceptibilities using SGLD, enabling efficient estimation of these metrics for individual checkpoints without full posterior sampling.
New Interpretability Paradigm: A shift from "ablation-based" causality to "response-based" sensitivity, providing a robust way to identify suppression mechanisms which are often missed by ablation.
Empirical Validation: Successful application to a 3M-parameter transformer trained on The Pile, demonstrating the ability to automatically recover known functional circuits.

4. Results

The authors applied their method to a 2-layer, 8-head-per-layer transformer trained on The Pile.

Recovery of Known Circuits:
- PC1 (Word Segmentation): A uniform response across heads, distinguishing word boundaries (Word Start/End) from spaces.
- PC2 (The Induction Circuit): Successfully isolated the induction circuit (heads 1:6, 1:7, 0:1, 0:4, 0:5). The analysis revealed that these heads have negative susceptibility (expression) for induction patterns ( $A \dots A B \dots B$ ) and positive susceptibility (suppression) for word endings. Conversely, the "multigram heads" (1:0 to 1:5) showed the opposite pattern, suppressing induction and expressing word endings.
- PC3 (Bracket Matching): Identified heads (e.g., 0:7, 1:3, 1:5) associated with predicting closing delimiters, consistent with previous "Dyck head" findings.
Suppression vs. Expression:
- The study confirmed that many heads function primarily by suppressing incorrect continuations rather than just promoting correct ones. For example, induction heads actively suppress non-induction patterns.
- The method successfully distinguished between heads that look similar in ablation studies but have distinct susceptibility profiles.
Robustness:
- Results were consistent across multiple training seeds.
- Per-token susceptibilities were shown to be non-redundant with standard loss metrics and ablation deltas (correlation $\approx 0$ ).

5. Significance and Future Directions

Theoretical Grounding: This work bridges the gap between statistical learning theory (specifically Singular Learning Theory and local learning coefficients) and mechanistic interpretability. It provides a mathematically rigorous definition of how data shapes model structure.
Overcoming Self-Repair: By measuring the response to data shifts rather than the effect of removing components, the method avoids the "self-repair" problem where networks compensate for ablations, offering a clearer view of component function.
Scalability: While demonstrated on a small model, the authors argue that SGLD scales well. The computational cost is comparable to training or circuit discovery via ablations, making it feasible for larger models (e.g., Pythia-70M and beyond).
Future Work: The authors suggest extending this to larger models using non-linear dimensionality reduction (like UMAP) and exploring the relationship between susceptibility and generalization error more deeply.

In summary, Structural Inference offers a powerful, physics-inspired lens to dissect neural networks, revealing that their internal organization is a dynamic balance of expression and suppression driven by the statistical structure of their training data.