Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

Here is an explanation of the paper "Distilled Circuits," broken down into simple concepts with creative analogies.

The Big Picture: The "Master Chef" and the "Apprentice"

Imagine you have a Master Chef (the "Teacher" model) who is a genius. They can cook a perfect steak, bake a complex cake, and make a gourmet soup. They have a massive kitchen, 12 different ovens, and a team of 12 specialized sous-chefs (these are the model's layers and attention heads).

Now, imagine you want to open a small food truck. You can't afford the Master Chef's massive kitchen or their huge team. So, you hire a talented Apprentice (the "Student" model). You teach the Apprentice by showing them the Master Chef's final dishes and saying, "Make it taste exactly like this."

This process is called Knowledge Distillation. It's how we shrink giant, slow AI models into smaller, faster ones that can run on phones or laptops.

The Problem:
We know the Apprentice can often make a dish that tastes just like the Master's. But how does the Apprentice do it?

Does the Apprentice use the same 12 ovens and 12 sous-chefs?
Or did they figure out a shortcut? Maybe they just use one oven and one sous-chef, but they work twice as hard?
What if the Master's recipe relies on a specific, delicate technique that the Apprentice skipped? If the ingredients change slightly (a "distribution shift"), will the Apprentice's dish fall apart?

This paper investigates exactly that. The authors didn't just taste the food; they went into the kitchen and watched the Apprentice cook to see what was happening under the hood.

The Kitchen Tour: What They Found

The researchers used a special set of "X-ray glasses" (called Mechanistic Interpretability) to see the internal wiring of the AI. They looked at two specific kitchens:

GPT-2 (Master) vs. DistilGPT-2 (Apprentice)
BERT (Master) vs. DistilBERT (Apprentice)
Llama (Master) vs. Minitron (Apprentice)

Here are their three main discoveries:

1. The "One-Man Band" Effect (Compression)

The Master Chef uses a whole orchestra to play a song. The Apprentice, however, realizes they don't need 12 instruments. They fire the orchestra and hire one incredibly talented musician who plays all the parts at once.

The Finding: The student models often compress multiple functions into a single component. Instead of having three different "heads" (sous-chefs) to detect numbers, the student might use just one head to do the job of all three.
The Risk: This is efficient, but it's fragile. If that one "super-sous-chef" gets sick (or if you remove them in an experiment), the whole kitchen shuts down. The Master Chef has backups; the Apprentice does not.

2. The "Forgotten Skill" (Discarding)

Sometimes, the Master Chef has a weird habit. Maybe they always hum a specific tune while chopping onions. It doesn't hurt the food, but it's part of their process.

The Finding: The Apprentice often discards these "non-essential" habits. In the paper, they found that the Master models had a specific mechanism to detect "similar members" (like noticing that the number '4' appeared twice in a list). The Apprentice models simply deleted this feature entirely.
The Result: The food still tastes good, but the internal logic is different. The Apprentice took a shortcut that the Master didn't take.

3. The "Brittle Foundation" (Robustness)

Because the Apprentice is relying on fewer, more heavily used components, their kitchen is brittle.

The Experiment: The researchers tried to "break" the kitchen by temporarily removing one sous-chef (an attention head) or one oven (an MLP layer).
The Result:
- Master Chef: "Oh, we lost a sous-chef? No problem, the others will pick up the slack." (Performance drops slightly).
- Apprentice: "Oh no! We lost our only sous-chef! The kitchen is on fire!" (Performance crashes completely).
The Takeaway: Distilled models are great at doing exactly what they were trained to do, but they are much more likely to fail if the situation changes slightly.

The New Tool: The "Alignment Score"

The authors realized that just looking at the final dish (the output) isn't enough. You can have two cakes that look identical, but one is made with flour and the other with sawdust (if you're lucky, they taste the same, but one is dangerous).

They created a new metric called the Alignment Score.

How it works: Instead of just asking "Did you get the right answer?", it asks, "Did you use the same brain pathways to get there?"
The Analogy: Imagine two students taking a math test.
- Student A solves the problem using the same steps as the teacher.
- Student B guesses the answer correctly by luck or a weird trick.
- Both get an 'A'. But the Alignment Score would give Student A a 10/10 and Student B a 2/10, because Student B didn't actually learn the mechanism of the math.

The paper shows that high performance (getting the right answer) does not guarantee high alignment (using the right internal logic).

Why Does This Matter?

This research is a warning label for the future of AI.

Efficiency vs. Safety: We love small, fast AI models because they save money and energy. But this paper shows that in making them smaller, we might be making them brittle. They might work perfectly in a controlled lab but fail catastrophically in the real world where things are messy.
The "Black Box" Problem: We often treat AI as a magic box. This paper opens the box and says, "Look, the magic is happening, but the gears inside are completely different from the original."
Better Selection: If you are a company trying to pick an AI model, don't just look at the test scores. Use this new "Alignment Score" to check if the model is actually thinking like the expert, or if it's just memorizing shortcuts.

Summary in One Sentence

Knowledge distillation creates smaller, faster AI models that can mimic the results of giant models, but they often do so by reorganizing their internal "gears" into a fragile, single-point-of-failure system that is less robust to change.

Here is a detailed technical summary of the paper "Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation."

1. Problem Statement

Knowledge Distillation (KD) is a widely used technique for compressing large neural networks (teachers) into smaller, more efficient models (students) by training the student to mimic the teacher's output distributions. While KD is effective at preserving task performance, the internal computational transformations that occur during this process remain poorly understood.

Current research focuses largely on optimizing the transfer of knowledge (e.g., via soft labels or Jacobian matching) but lacks a mechanistic understanding of how the student model reorganizes its internal circuits, representations, and activation patterns. Key questions remain unanswered:

Do students replicate the teacher's internal reasoning mechanisms, or do they develop alternative, potentially brittle strategies?
How does KD affect the robustness of internal components?
Can we quantify the "functional alignment" between teacher and student beyond simple output similarity?

2. Methodology

The authors apply Mechanistic Interpretability (MI) techniques to analyze the internal structure of distilled models. The study focuses on three primary model pairs:

GPT2 (124M) $\to$ DistilGPT2 (82M)
BERT (109M) $\to$ DistilBERT (66M)
Llama-3.1-8B $\to$ Llama-3.1-Minitron-4B

The methodology involves three core components:

A. Circuit Discovery and Analysis

Using the Numeral Sequence Completion task (and others like Indirect Object Identification and Question Answering), the authors identify the specific "circuits" (subgraphs of attention heads and MLPs) responsible for task performance.

Techniques: Iterative pruning, path patching, and activation patching are used to isolate critical components.
Metrics: Performance is measured via logit difference ( $\Delta \ell$ ) between the correct token and a distractor.
Comparison: The authors manually compare the functional roles of components (via Query-Key matrices for heads and PCA/residual stream analysis for MLPs) between teacher and student to identify reorganization, compression, or omission.

B. Robustness Evaluation

The authors perform component ablation studies to measure robustness. By corrupting or removing specific attention heads and MLPs, they measure the resulting drop in performance. This reveals how reliant a model is on specific components versus distributed redundancy.

C. Alignment Metric

To automate the comparison of functional alignment, the authors propose a new metric, $A_{T,S}$ , which quantifies how well the student replicates the teacher's internal functional behavior.

Step 1: Influence Scoring: Calculate the normalized influence of each component based on the performance drop caused by its ablation.
Step 2: Matching: Match teacher components to student components based on representational similarity (cosine similarity of mean activations for heads; eigenvector similarity for MLPs).
Step 3: Calculation: Compute a weighted average of the agreement between the influence scores of matched pairs.
$A_{T,S} = \frac{1}{|M|} \sum_{(c_T, c_S) \in M} S(c_T, c_S) \cdot (1 - |I_T(c_T) - I_S(c_S)|)$
Where $S$ is representational similarity and $I$ is the normalized influence score. This metric penalizes functional divergence even if output performance is similar.

3. Key Contributions

Mechanistic Evidence of Restructuring: The paper provides empirical evidence that KD does not merely "shrink" a model but fundamentally restructures it. Students often:
- Compress multiple teacher functions into single student components.
- Discard specific teacher components (e.g., "similar member detection" heads) deemed non-critical.
- Reorganize computation, leading to a stronger reliance on fewer, critical components.
The Alignment Metric: Introduction of a scalable, automated metric to quantify functional alignment, demonstrating that output similarity (logit difference) is a poor proxy for internal mechanistic similarity.
Robustness Trade-off: Quantification of the trade-off between parameter efficiency and robustness, showing that distilled models are significantly more brittle to component ablation than their teachers.

4. Key Results

Internal Restructuring Patterns

Compression: In the GPT2/DistilGPT2 pair, the student merged the functionality of two teacher MLPs (MLP-T-9 and MLP-T-10) into a single student MLP (MLP-S-4).
Omission: The "similar member detection" head found in the teacher (T-L1-H5) was completely absent in the student, suggesting KD filters out inductive biases that are not strictly necessary for the specific task.
Increased Reliance: Students show significantly higher reliance on single components. For example, ablating the "successor head" in the student caused a 77.57% performance drop, compared to only 34.94% in the teacher.

Robustness Findings

Brittleness: Distilled models exhibit significantly lower robustness to component ablation across all model families (GPT, BERT, Llama).
Quantitative Gap:
- GPT Pair: Student ablation drop mean = 12.24% vs. Teacher = 3.06% (Difference: 9.18%).
- BERT Pair: Student ablation drop mean = 16.89% vs. Teacher = 6.26% (Difference: 10.62%).
- Llama Pair: Student ablation drop mean = 2.20% vs. Teacher = 0.84% (Difference: 1.36%).
- Note: While the Llama pair is more robust overall, the student is still less robust than the teacher, and the gap correlates with the degree of compression.

Alignment Metric Validation

Decoupling Performance and Mechanism: The metric revealed cases where models had similar output performance but low internal alignment (e.g., BERT/DistilBERT on word sequence completion had $\Delta \ell \approx 0.66$ but alignment score of 0.84, whereas GPT2/DistilGPT2 had $\Delta \ell \approx 5.53$ but alignment of 0.94).
Noise Sensitivity: Injecting noise into student activations inversely correlated with the alignment score, validating the metric's sensitivity to functional divergence.
Cross-Architecture: The metric successfully identified low alignment between mismatched architectures (GPT2 vs. DistilBERT), confirming it captures structural differences.

5. Significance and Implications

Safety and Reliability: The findings suggest that while distilled models may perform well on in-distribution data, their brittle internal mechanisms make them vulnerable to distribution shifts and adversarial attacks. They rely on "computational shortcuts" rather than the robust, distributed reasoning of the teacher.
Evaluation Paradigm: The paper argues that evaluating KD solely on accuracy or loss is insufficient. The proposed Alignment Metric offers a new standard for selecting student models, prioritizing those that preserve the teacher's internal computational pathways.
Future Directions: The work highlights the need for theoretical frameworks to explain why KD merges or discards circuits and suggests that future distillation losses could incorporate alignment penalties to prevent the learning of brittle shortcuts.

In conclusion, this paper demonstrates that knowledge distillation is a process of functional compression and reorganization that often sacrifices internal robustness for parameter efficiency. The proposed alignment metric provides a crucial tool for diagnosing these hidden structural changes.