Structural Inference: Interpreting Small Language Models with Susceptibilities

This paper introduces a linear response framework that models neural networks as Bayesian statistical mechanical systems to efficiently compute susceptibility-based attribution scores, revealing a low-rank structure that isolates functional modules like multigram and induction heads in small transformers.

Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine a neural network (like a small AI brain) as a complex, living ecosystem inside a black box. We know it can write code, tell jokes, and solve math problems, but we have no idea how the tiny gears inside are turning to make those decisions.

This paper introduces a new way to peek inside that black box, called Structural Inference. Instead of trying to take the machine apart piece by piece (which is hard and messy), the authors use a concept borrowed from physics called Susceptibility.

Here is the simple breakdown using everyday analogies:

1. The Core Idea: The "Magnetic" Test

In physics, if you put a piece of iron near a magnet, it gets pulled toward it. If you put a piece of wood near a magnet, nothing happens. Scientists measure this "pull" to understand what the object is made of. This pull is called magnetic susceptibility.

The authors treat the AI model like a piece of metal and the data (the text it was trained on) like the magnet.

  • The Experiment: They slightly change the "flavor" of the data the AI sees. For example, they mix in a little bit of GitHub code or legal documents into the AI's usual diet of random internet text.
  • The Reaction: They watch how the AI's internal parts (specifically, the "attention heads," which are like little specialized workers) react to this change.
  • The Measurement: If a specific worker suddenly gets very excited or very quiet when you add code to the mix, that worker has a high "susceptibility" to code.

2. The Two Types of Reactions: "Expression" vs. "Suppression"

The paper discovers that these internal workers don't just react; they have personalities. They either express (encourage) or suppress (inhibit) certain patterns.

  • Expression (The Cheerleader): Imagine a worker who loves a specific pattern, like a repeated phrase ("The cat sat on the mat... The cat sat on the..."). When the AI sees this pattern, this worker says, "Yes! Predict the next word 'mat'!"
    • In the paper: This is a negative susceptibility. It means the worker is helping the pattern happen.
  • Suppression (The Bouncer): Imagine a worker who hates that same pattern. When the AI sees the repeated phrase, this worker says, "No! Don't predict 'mat'! That's too predictable!"
    • In the paper: This is a positive susceptibility. It means the worker is actively trying to stop the pattern.

Analogy: Think of a band playing a song.

  • The Expression workers are the musicians playing the melody.
  • The Suppression workers are the sound engineers turning down the volume on a specific instrument because it's too loud or distracting.
  • The paper shows that to understand the song, you need to know who is playing and who is muting what.

3. The Big Discovery: Finding the "Induction Circuit"

The authors applied this method to a small AI model (3 million parameters, which is tiny by today's standards) and found something amazing.

By looking at how different workers reacted to different data "flavors," they could group the workers into teams based on their jobs. They successfully identified a famous team known as the "Induction Circuit."

  • What is the Induction Circuit? It's a specific group of workers that allows the AI to learn from context. If you write "The capital of France is Paris. The capital of Spain is...", this circuit helps the AI figure out the next word is "Madrid" because it recognizes the pattern.
  • The Result: The paper didn't just guess where this circuit was; it mathematically proved it by showing that these specific workers react strongly to "induction patterns" in the data, while other workers react to different things (like word endings or brackets).

4. Why This Matters

Before this, finding these internal circuits was like trying to find a specific needle in a haystack by looking at the whole haystack. You had to guess, break things, and hope for the best.

This new method is like using a metal detector.

  • You don't need to dig up the whole beach.
  • You just scan the surface with a specific signal (a change in data).
  • The detector beeps exactly where the "metal" (the functional circuit) is hidden.

Summary

The paper says: "Don't just look at what the AI says; watch how it reacts when you change the world it lives in."

By treating the AI like a physical object that reacts to external forces (data changes), the authors created a map of the AI's internal brain. They found that the AI is organized into specialized teams, some of which act as cheerleaders for patterns and others as bouncers, and this new "susceptibility" tool is a powerful way to find them without breaking the machine.