The Big Idea: Finding the "Hidden Switches" in AI

Imagine a huge, complex machine (like a neural network) that has learned to perform a task, such as adding numbers or writing stories. You can watch the machine at work, but you cannot see how it thinks. It is like looking into a black box: you put a number in, and another number comes out, yet the gears inside remain hidden.

Scientists want to open the box and find the specific "switches" or "knobs" inside that the machine uses to understand concepts like "grammar," "addition," or "sentiment." This is called mechanistic interpretability.

The problem is that the machine has millions of knobs, all tangled together. Picking one at random is like trying to find a specific needle in a haystack by guessing.

Jennifer Lin's paper proposes a new, clever way to find these needles. Instead of guessing, the author uses a mathematical tool called the Empirical Neural Tangent Kernel (eNTK).

The Analogy: The "Echo Chamber" Test

Imagine the neural network as a giant echo chamber. If you shout a specific word (a feature like "noun" or "add 5"), the sound echoes around the room and hits the walls (the model's parameters) in a very specific pattern.

The eNTK is like a highly sensitive microphone that records how the entire room vibrates when you shout.

If you shout "noun," the room vibrates in a specific rhythm.
If you shout "verb," it vibrates in a different rhythm.

The author's hypothesis is: If we analyze the strongest vibrations (the "principal eigen-directions") in this echo chamber, we can pinpoint exactly which words were shouted.

In technical terms, the paper claims that by examining the "strongest patterns" of how the model's internal gears move while it learns, we can identify the exact directions the model uses to detect features.

The Three Experiments: From Simple Math to Large Language Models

The author tested this "echo chamber" idea on three different machine types, each becoming increasingly complex.

1. The Simple Math Machine (MLP)

The Task: A simple machine learned to add numbers modulo a prime number (a specific type of math puzzle).
The "Truth": We already knew the secret recipe the machine used: it transformed numbers into waves (Fourier features), for instance, by converting a number into a sine wave.
The Result: The author used the eNTK to listen to the machine. The strongest vibrations found by the eNTK matched the "sine wave" recipe perfectly.
The "Grokking" Moment: There is a phenomenon called "grokking," where a model suddenly shifts from failing a test to solving it perfectly after a long period of mere memorization. The paper found that at the moment the machine "grokked" (understood the math), the alignment between the eNTK vibrations and the mathematical features increased sharply. It is as if, at the exact moment the machine finally "got it," the echo chamber suddenly began singing the right song.

2. The Slightly Smarter Math Machine (Transformer)

The Task: A somewhat more complex machine (a Transformer) learned the same math puzzle.
The Difference: This machine did not use every possible wave; it selected some random, specific frequencies to solve the problem.
The Result: Even though the machine chose random frequencies, the eNTK still found them. It successfully identified the specific "notes" the machine used for math.

3. The Large Language Model (Gemma-3-270M)

The Task: This is a real, pre-trained language model (like a mini-version of the AI you chat with) that reads stories.
The Challenge: Here, we do not know the "secret recipe." We only want to see if the machine can recognize grammar (such as nouns, verbs, or past tense).
The Test: The author took a small set of stories and asked: "Can the eNTK vibrations tell us which words are nouns?"
The Comparison: She compared the eNTK method with PCA (a standard, older method that only looks at the most active parts of the machine).
The Result: The eNTK method was better. It found the "grammar switches" more accurately than the standard method. For example, it was better at recognizing "verbs" or "past tense" than the old method.

The Main Takeaway

The paper claims that analyzing the "vibrations" of the model's learning process (via the eNTK) is a powerful new flashlight.

It works on simple mathematical models where we know the answer.
It works on complex language models where we do not know the answer, and it finds grammar features better than current standard tools.
It seems to shine brightest exactly when a model suddenly understands a concept (the "grokking" moment).

What the Paper Does Not Claim

It is important to stick to what the paper actually says:

It is not a cure-all: The paper admits these are "correlative" results. Just because the eNTK finds a direction that looks like "grammar" does not prove that changing that direction will fix the model. It is a discovery tool, not necessarily a control panel.
It is not about future AI safety: The paper mentions that this could be useful for safety in the future, but it does not present safety applications or clinical uses. It is purely a method to understand how models work now.
It is not perfect: The experiment with the language model used a relatively small dataset and a specific model. The author suggests testing this on larger models and datasets to be sure.

Summary in One Sentence

This paper proposes that by listening to the "echoes" of how a neural network learns (using a tool called eNTK), we can successfully identify the hidden "switches" the model uses to understand math and grammar, often finding them more clearly than previous methods.

Technical Summary: Feature Identification via the Empirical NTK

Problem Statement

Mechanistic interpretability aims to reconstruct how neural networks process information, with the specific goal of determining how models represent learned features. While earlier approaches often assume that individual neuron activations or sparse linear combinations thereof represent interpretable features, recent studies suggest these methods may yield incomplete or non-canonical dictionaries. Consequently, there is a need for fundamentally different approaches to identify feature directions in trained models without relying on prior assumptions about the specific nature of these features.

This work investigates whether the top eigen-directions of the empirical Neural Tangent Kernel (eNTK) can serve as a mechanism to uncover these learned features. The eNTK is defined as the kernel formed by contracting two copies of the model's Jacobian matrix along the parameter space direction:
$K_{ij}(x_1, x_2) = \sum_{\mu} \frac{df_i(x_1)}{dW_\mu} \frac{df_j(x_2)}{dW_\mu}$
where $f$ is the neural network, $W_\mu$ represents the weights, and $i, j$ index the output classes. The authors hypothesize that the top eigenspaces of this kernel, evaluated on a dataset, align with ground-truth or interpretable feature directions, even in models operating outside the "lazy" training regime to which standard NTK theory strictly applies.

Methodology

The authors propose an algorithm to compute and analyze the top eigen-directions of the eNTK in three different settings: a 1-layer MLP, a 1-layer Transformer, and a pre-trained Large Language Model (Gemma-3-270M).

1. Kernel Construction and Reduction

The eNTK, evaluated over a dataset of size $N$ with $C$ output classes, has the shape $(N, N, C, C)$ . To perform eigendecomposition, the authors apply two reduction strategies:

Class-specific eNTK: Analysis of the kernel $K_{cc}(x_1, x_2)$ for specific classes.
Flattened eNTK: Stacking the class-specific blocks into a single $NC \times NC$ matrix.
Layer-wise eNTK: Summing Jacobian products only over parameters belonging to a specific layer to attribute features to specific network components.

2. Scalability via Lanczos Iteration

For large models where materializing the full Jacobian matrix or the eNTK is infeasible (e.g., language models with large vocabularies), the authors utilize Lanczos Iteration. They approximate the top $k$ eigen-directions by performing $2k$ steps of matrix-vector products. Crucially, they compute $Kv = J(J^T v)$ using vector-Jacobian and Jacobian-vector products via automatic differentiation, thereby avoiding the explicit construction of the Jacobian matrix or the eNTK.

3. Efficient Reconstruction for Language Models

For the Gemma-3-270M experiment, the vocabulary size ( $d_{vocab}$ ) makes the flattened eNTK at the output layer computationally prohibitive. The authors leverage the linear relationship between the output layer's Jacobian matrix and that of the last hidden layer (via the unembedding matrix $U$ ). They derive a transformed operator $\tilde{K} = S^{1/2} K_r S^{1/2}$ (where $K_r$ is the eNTK on the residual stream), which possesses the same eigenvalues as the full output eNTK but operates in the smaller $d_{model}$ space. This enables the reconstruction of the top eNTK eigen-directions without materializing large, vocabulary-sized objects.

4. Evaluation Metric

To validate the hypothesis, the authors measure the alignment between eNTK eigenspaces and independently specified "ground-truth" feature vectors.

Alignment Score: Calculated as the squared Frobenius norm between the subspace spanned by the top $k$ eNTK eigenvectors and the subspace spanned by the ground-truth features.
Baseline Comparison: In the language model setting, the eNTK approach is compared against a Principal Component Analysis (PCA) baseline performed on model activations, using the same computational budget (top 25 directions).

Key Results

1. MLP on Modular Arithmetic

In a 1-layer MLP trained on modular addition ( $mod\ p$ ) exhibiting "grokking" (a phase transition from memorization to generalization):

Spectral Structure: The eNTK spectrum shows two distinct "cliffs" (contiguous blocks of high eigenvalues).
Feature Alignment: The first cliff (size $4\lfloor p/2 \rfloor$ ) aligns perfectly with the Fourier features of the input variables ( $a$ and $b$ ). The second cliff aligns with the Fourier features "Sum" and "Difference" ( $a+b$ and $a-b$ ), which are used by the model's second layer to implement the ground-truth algorithm.
Training Dynamics: The alignment of the second cliff with sum/difference modes is low at initialization but increases smoothly, with the first derivative of the overlap reaching its maximum near the onset of the grokking phase transition.

2. Transformer on Modular Arithmetic

In a 1-layer Transformer trained on the same task:

Sparse Frequencies: In contrast to the MLP, the Transformer learns Fourier modes at a sparse set of random, seed-dependent frequencies.
Layer-wise Alignment: The top layer-wise eNTK eigen-directions align with Fourier features at these specific key frequencies.
- The Attention block and the MLP input weights align with the sum of input Fourier features ( $\cos(\omega_k a) + \cos(\omega_k b)$ ).
- The MLP output and the unembedding weights align with the "Sum" Fourier features ( $\cos(\omega_k(a+b))$ ).
Dynamics: Similar to the MLP, alignment with sum modes increases during training and peaks in the derivative near the grokking transition.

3. Gemma-3-270M on Natural Language

In the pre-trained Gemma-3-270M model, evaluated on a dataset of TinyStories context windows:

Grammar Reconstruction: Top eNTK eigen-directions were tested against automatically generated grammatical features (parts of speech and morphological tags such as tense and number).
Performance: The eNTK eigen-directions outperformed the PCA baseline on model activations for all parts-of-speech features and all but one morphological feature, measured by AUROC.
Interpretability: A qualitative analysis of the most activating examples for specific eigen-directions (e.g., "infinitive" or "past tense verb") revealed coherent semantic interpretations consistent with the target grammatical features.

Significance and Claims

The work claims that eNTK eigen-analysis offers a new, theoretically motivated, and empirically validated approach to identifying features in trained models.

Beyond the Lazy Regime: The work demonstrates that eNTK spectral structures remain informative and align with ground-truth mechanisms even in models not in the "lazy" training regime (where parameter drift is negligible), a regime to which standard NTK theory does not strictly apply.
Superiority over Activation PCA: In the context of the language model, the eNTK approach more successfully recovers grammatical features than PCA on activations, suggesting that the kernel structure captures feature information that raw activations (even when reduced via PCA) may obscure.
Dynamic Monitoring: The observation that the alignment of eNTK subspaces with features evolves during training—particularly with a peak rate of change near grokking—suggests that eNTK eigen-analysis could serve as a diagnostic tool to monitor when specific features are acquired during training.

The authors maintain a modest stance, noting that their results are currently correlational. They have not yet demonstrated that eNTK-inspired interventions causally alter model behavior, and they acknowledge limitations regarding the scaling of the language model experiment (Gemma-3-270M is smaller than state-of-the-art models) and the simplicity of the dataset (TinyStories). Nevertheless, the consistency of results across synthetic algorithmic tasks and natural language points to robust potential for eNTK-based mechanistic interpretability.

Feature Identification via the Empirical NTK