Dissecting Jet-Tagger Through Mechanistic… — Plain-Language Explanation

The Big Picture: Cracking the Black Box

Imagine you have a super-smart robot that can look at a messy spray of particles (called a "jet") from a particle collider and instantly tell you if it came from a heavy Top Quark (a signal) or a common QCD jet (background noise). This robot is incredibly accurate, but until now, no one knew how it made that decision. It was a "black box."

This paper is like taking that robot apart to see exactly which gears and wires are doing the work. The authors didn't just guess; they used a special toolkit called Mechanistic Interpretability to reverse-engineer the robot's brain. They found that the robot isn't using its whole brain to make the decision; it's actually relying on a very small, specific team of six "neurons" (called attention heads) to do almost all the heavy lifting.

The Cast of Characters: The Jet and the Robot

The Jet: Think of a Top Quark jet like a firework that exploded in mid-air. It breaks into three main pieces (a heavy "W" particle and a "b" quark). A background jet is more like a random sparkler. The robot's job is to spot the firework pattern.
The Robot (Particle Transformer): This is a type of AI that looks at every single particle in the spray and how they relate to one another. It has layers of "thinking" (attention heads) that pass information down a line.

The Discovery: The "Six-Person Dream Team"

The authors found that out of the robot's 16 total "thinking heads," only six are actually responsible for the Top Quark detection. If you turned off the other ten, the robot would still work almost perfectly (97.3% as good as before).

They mapped out exactly how these six heads talk to each other, creating a "circuit" that looks like a relay race:

The Scout (Primary Source): One head at the very beginning acts as the scout. It doesn't look for the big explosion directly. Instead, it looks at the "background noise" (soft, quiet particles) to set the stage. It's like a security guard checking the perimeter so the rest of the team knows what "normal" looks like.
The Relays (Middle Team): Three heads in the middle act as messengers. They take the Scout's context and look specifically for heavy, energetic pairs of particles. In a Top Quark jet, the "W" particle decays into two heavy quarks. These relay heads are like detectives zooming in on those two heavy partners, ignoring the rest of the mess.
The Reader (Readout): One head at the end gathers all the reports from the relays and says, "Okay, I see the pattern. This is a Top Quark."

The Analogy: Imagine a courtroom.

The Scout is the bailiff who clears the room of distractions.
The Relays are the lawyers who specifically point out the two heavy, smoking guns (the W-decay particles).
The Reader is the judge who, after hearing the evidence, bangs the gavel and declares the verdict.

The "Aha!" Moment: It's Not What You Think

The authors found two surprising things about how the robot thinks:

1. The "Basis Rotation" (The Translator)
At first, the robot seemed to make its decision all at once at the very end. But the authors realized the robot actually figured out the answer way earlier (in the first layer of thinking). The final step wasn't "finding" the answer; it was just translating the answer into a language the final judge could understand.

Analogy: Imagine you solve a math problem in your head (the early layers). You know the answer is "42." But your final answer sheet only accepts Roman numerals. The last step isn't doing the math; it's just writing "XLII" instead of "42." The robot was just translating its own thoughts.

2. The "Cheat Code" (2-Prong vs. 3-Prong)
The job was to find a 3-part explosion (Top Quark). However, the robot discovered a shortcut. It realized that the hardest part of the explosion is finding the heavy "W" particle, which is a 2-part explosion.

Analogy: Imagine you are trying to identify a specific type of car by its 4 wheels. Instead of checking all 4 wheels, the robot realized that if you find the two heavy rear wheels, you can be 99% sure it's that car. It ignored the full 3-part complexity and focused on the easier 2-part pattern (the W-decay) because that was the most reliable clue. The robot "invented" a simpler strategy to solve a complex problem.

The Toolkit: How They Did It

To find these secrets, the authors used a few clever tricks:

Zero Ablation: They "turned off" specific heads to see if the robot crashed. If it kept working, that head wasn't important.
Path Patching: They took a "clean" jet and a "corrupted" jet, and swapped parts of the brain's thinking between them to see which parts carried the crucial information.
Linear Probing: They asked the robot, "What physical things do you see?" and found it was looking for specific energy patterns (Energy Correlators) rather than the standard textbook measurements (N-subjettiness).

The Conclusion

This paper proves that we can take a complex, "black box" AI used in high-energy physics and understand exactly how it works. The robot didn't just memorize the data; it learned a logical, step-by-step physical strategy:

Check the background.
Find the heavy 2-part pair (the W-boson).
Translate that finding into a final decision.

It turns out that even without being explicitly taught physics, the AI rediscovered a smart, efficient way to spot Top Quarks by focusing on the most obvious physical clue.

Technical Summary: Dissecting Jet-Tagger Through Mechanistic Interpretability

Problem Statement
High Energy Physics (HEP) relies heavily on "Jet Tagging," the task of identifying the primary origin of particle cascades (jets), such as distinguishing boosted top quarks from background QCD jets. While deep learning architectures like the Particle Transformer (ParT) have achieved state-of-the-art performance, surpassing hand-crafted observables, their internal decision-making processes remain opaque. Existing interpretability efforts in collider physics have largely focused on post-hoc attribution (e.g., Shapley values, saliency maps) or interpretable-by-construction architectures. However, these methods do not reverse-engineer the specific causal circuits within a trained black-box model or characterize the algorithmic steps the network implements. This paper addresses the gap by applying the full toolkit of mechanistic interpretability—originally developed for natural language models—to a jet physics classifier.

Methodology
The authors trained a small-scale Particle Transformer (4 particle attention layers, 4 heads per layer, ~1.3M parameters) on a subset of the Top Quark Tagging reference dataset (signal: $t \to Wb \to q\bar{q}b$ ; background: light quarks/gluons). The analysis employs a suite of intervention and probing techniques:

Zero Ablation: Systematically setting attention head scaling weights to zero to measure structural importance via performance drop.
Path Patching: A causal intervention method where the activation of a specific head on a "clean" input is transplanted into a "corrupted" input (using within-batch particle replacement or jet permutation). This isolates direct effects and path effects (information flow) between heads.
Logit Lens & Per-Layer Probes: The Logit Lens projects intermediate representations through the final trained head to estimate class information. To resolve basis misalignment, the authors trained separate logistic regression probes on each layer's representation.
Linear Probing: Training linear models (Ridge/Logistic regression) on the residual stream to predict classical jet substructure observables (e.g., $N$ -subjettiness, Energy Correlators).
Feature Ablation: Zeroing specific pairwise input features to the MLP to determine causal dependencies.

Key Contributions and Results

Identification of a Sparse Six-Head Circuit:
The authors identified a minimal circuit of six attention heads that recovers 97.3% of the full model's Area Under the Curve (AUC). This circuit significantly outperforms randomly sampled six-head subsets (96th percentile of the random baseline). The circuit exhibits a clear source-relay-readout structure:
- Primary Source ( $L0H1$ ): A single head in the first layer that acts as the primary causal driver. It attends to soft and collinear emissions (negative correlations with $\ln k_T$ , $\ln \Delta$ , $\ln m^2$ ) and provides a class-agnostic contextualization.
- Secondary Source ( $L0H2$ ): A parallel head in the first layer that attends to hard, high-invariant-mass pairs.
- Relay Heads ( $L1H0, L1H1, L1H3$ ): A cluster of heads in the second layer that selectively attend to high-invariant-mass, high- $k_T$ particle pairs (characteristic of the hadronic $W$ -boson decay). These heads are conditionally dependent on the upstream source heads.
- Readout Head ( $L3H3$ ): A single head in the final particle layer that aggregates the relayed signals.
Causal Structure and Robustness:
Path patching revealed that the primary source head ( $L0H1$ ) alone recovers 88.6% of the model's performance. The sign patterns of direct effects were robust across two on-manifold corruption strategies (particle replacement and jet permutation). The authors noted that off-manifold (Gaussian) corruption strategies are structurally incompatible with standard recovery-score formulations on this kinematically narrow dataset, as they drive the model into a fixed, strongly negative logit regime.
Resolution of the "Commitment" Point:
Standard Logit Lens analysis suggested the model commits to a classification decision only at the first class attention block ($Cls0$), showing a dramatic jump in AUC from 0.111 to 0.973. However, per-layer trained probes revealed that linearly accessible class information saturates at AUC $\approx$ 0.97 as early as the first particle attention layer ( $L1$ ). The authors conclude that the class attention block performs a basis rotation, reorienting the latent signal from the particle attention basis to the classification head's basis, rather than computing new information.
Physical Content of Representations:
Linear probing of the residual stream against classical observables revealed two critical findings:
1. Energy Correlator Preference: The model preferentially encodes the Energy Correlator basis over the $N$ -subjettiness basis, even after residualizing against jet mass.
2. Implicit Factorization: Despite being trained on a 3-prong top-tagging task, the model encodes 2-prong observables (specifically $D^{(\beta=1)}_2$ , targeting the hadronic $W$ decay) more strongly than the formally correct 3-prong observables ( $N^{(\beta=1)}_3$ ). The model implicitly factorizes the 3-prong problem into the more accessible sub-problem of identifying the heavy $W$ -boson substructure.

Significance and Claims
The paper claims that mechanistic interpretability methods developed for natural language models can be successfully transferred to jet physics classifiers. The study demonstrates that gradient descent, without explicit supervision or architectural constraints, can rediscover physically meaningful structures:

The network implicitly organizes its computation around the hadronic $W$ -boson decay (a 2-prong substructure) rather than the full 3-prong topology.
The identified source-relay-readout circuit appears to be a generic algorithmic pattern for physics classification, not merely an artifact of language model training.
The work provides a "clean source-relay-readout interpretation" of a complex deep learning model, moving beyond attribution to a causal understanding of how the model computes its decision.

The authors remain modest regarding generalizability, noting that the specific six-head circuit was identified in a small model and that larger models may possess richer structures. They also highlight limitations regarding the linear accessibility of information (non-linear encodings may be under-reported) and the reliance on a single training seed for detailed causal analysis.

Dissecting Jet-Tagger Through Mechanistic Interpretability