How to Deep-Learn the Theory behind Quark-Gluon Tagging

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a crime at a massive, chaotic party. The "crime" is distinguishing between two types of guests: Quarks and Gluons.

In the world of particle physics, these are the building blocks of matter. When they zoom through a detector, they leave behind a trail of debris called a "jet."

Quark jets are like a tight-knit group of friends walking together; they are fewer in number and stay close to the center.
Gluon jets are like a rowdy crowd at a mosh pit; they are more numerous, spread out, and chaotic because they have a stronger "charge" that makes them radiate energy more aggressively.

For years, physicists have used complex computer programs (Machine Learning) to sort these guests. These programs are incredibly good at it, but they are "black boxes." They give you the answer ("That's a quark!"), but they don't tell you why. It's like a genius chef who makes a perfect dish but refuses to share the recipe.

This paper is about cracking the recipe. The authors want to open the black box, see what the computer is actually looking at, and translate its complex math into simple, human-readable formulas.

Here is how they did it, broken down into four simple steps:

1. The "Compression" Trick (Finding the Core Clues)

The computer starts with a massive list of data for every particle in the jet (its energy, angle, type, etc.). It's like having a 64-page dossier on every guest. The authors asked: "Do we really need all 64 pages?"

They used a technique called PCA (Principal Component Analysis), which is like a super-smart editor. It reads all 64 pages and says, "Actually, you only need three main points to understand the story."

Clue #1 (The Crowd Size): How many particles are there? (Gluons have more).
Clue #2 (The Spread): How wide is the mess? (Gluons are wider).
Clue #3 (The Mix): How are the energies shared? (Quarks have a "harder" core).

They found that the computer had naturally learned these three exact things, even though no one told it to. It rediscovered the physics rules on its own!

2. The "Blindfolded" Problem (The SHAP Test)

Next, they tried to use a tool called SHAP to see which clues were most important. SHAP is like a game where you remove one clue at a time to see how much the computer's confidence drops.

However, they hit a snag. Because the clues are related (e.g., a wider jet usually has more particles), the tool got confused. It was like trying to judge a chef's skill by removing the salt, but forgetting that the salt was added because the soup was too bland. The tool gave misleading results, blaming the wrong ingredients.

The Fix: They realized they needed to "de-correlate" the clues first. They created new, independent variables (like measuring the "shape" of the jet without counting the particles). Once they did this, the SHAP tool finally gave a clear, honest answer: "Yes, particle count is the most important clue, followed by the shape."

3. The "Recipe" Hunt (Symbolic Regression)

This is the coolest part. The authors wanted to turn the computer's complex decision-making into a simple math equation that a human could write on a napkin. They used a method called Symbolic Regression.

Think of this as a genetic algorithm that evolves math formulas. It starts with simple guesses like 1 / (number of particles) and mutates them, trying thousands of combinations until it finds a formula that matches the computer's brain perfectly.

The Result:
They found that the computer's complex "brain" could be replaced by a surprisingly compact formula.

For a single clue, the formula looked like: Probability = tanh(1 / number of particles).
For the full set of clues, the formula was a bit longer but still readable: It combined the "spread" of the jet, the "diversity" of the particles, and the "energy balance" into one neat equation.

4. Why This Matters

Why go through all this trouble?

Trust: If a computer says "This is a quark," a physicist can now look at the formula and say, "Ah, I see. It's a quark because the particle count is low and the shape is narrow." This builds trust.
Speed: A neural network is heavy and slow to run on massive datasets. A simple math formula is lightning fast. In the future, these formulas could replace the heavy computers in real-time experiments.
Discovery: By understanding how the computer thinks, we might discover new physics patterns that we missed before.

The Big Picture

Imagine you have a super-intelligent alien who can identify a tiger from a lion instantly. You ask, "How do you do it?"

Old way: The alien says, "Trust me, I just know." (Black Box).
This paper's way: The alien says, "I look at the stripe pattern, the ear shape, and the tail length. If I combine these three numbers using this specific formula, I get the answer."

The authors successfully translated the "alien language" of deep learning into the "human language" of physics formulas, proving that even the most complex AI can be understood, explained, and simplified.

1. Problem Statement

Quark-gluon (QG) tagging is a critical task in High Energy Physics (HEP) for distinguishing jets initiated by quarks from those initiated by gluons. This capability is essential for separating signal from background in searches for Beyond Standard Model (BSM) physics and precision Standard Model (SM) measurements.

However, QG tagging presents unique challenges:

Theoretical Ambiguity: The distinction between quark and gluon jets is ill-defined beyond leading order in QCD and is highly sensitive to soft/collinear splittings, parton showers, and hadronization.
Black Box Nature: Modern Machine Learning (ML) taggers (e.g., ParticleNet) operating on low-level detector inputs outperform traditional high-level observable-based methods. However, their internal decision-making processes are opaque.
Need for Interpretability: To trust these tools in experimental analyses and potentially discover new physics, it is necessary to understand what features the network relies on, whether they align with known physics, and if they can be approximated by compact analytical formulas.

2. Methodology

The authors employ a multi-stage Explainable AI (XAI) pipeline to analyze a ParticleNet-Lite architecture trained on quark and gluon jets. The workflow involves:

Dataset: 2 million jets generated using Pythia 8.2 (and validated with Herwig 7.1). The dataset focuses on light-flavor jets from $Z(\to \nu\bar{\nu}) + \text{jet}$ processes.
Model: A compact ParticleNet-Lite (Graph Convolutional Network) that processes unordered sets of jet constituents (four-vectors and PID) to output a 64-dimensional latent feature vector.
Analysis Techniques:
1. Dimensionality Reduction: Principal Component Analysis (PCA) and a Disentangled Latent Classifier (DLC) (an autoencoder with a classification head and a decorrelation loss) to compress the 64D latent space and identify dominant directions.
2. Feature Attribution: SHAP (SHapley Additive exPlanations) values to assess the importance of specific high-level observables. The authors specifically address the limitation of SHAP regarding correlated inputs by constructing decorrelated feature sets.
3. Symbolic Regression: Using the PySR framework (genetic algorithm) to evolve mathematical formulas that approximate the neural network's output based on high-level observables.

3. Key Contributions & Results

A. Latent Space Analysis (Linear & Non-Linear)

The authors successfully compressed the 64-dimensional ParticleNet output into a few physically interpretable directions:

PC1 (Multiplicity & Diversity): Dominated by particle multiplicity ( $n_{pf}$ ) and Particle Identification (PID) entropy ( $S_{PID}$ ). It captures the quantity and variety of particles, reflecting the higher radiation of gluon jets.
PC2 (Radial Profile): Captures the radial energy distribution (jet width/shape) independent of multiplicity. It correlates with observables like girth ( $w_{pf}$ ) and ellipticity. The authors introduce a decorrelated observable $r_\lambda$ (ratio of Les Houches angularities) to isolate this feature.
PC3 (Fragmentation): Encodes how energy is shared among constituents (fragmentation pattern), correlated with fragmentation entropy ( $S_{frag}$ ) and energy correlation functions.
DLC Validation: A Disentangled Latent Classifier confirmed that 3 latent dimensions are sufficient to recover the full network's performance (AUC $\approx$ 0.902), validating that the network learns a low-dimensional physical manifold.

B. Feature Importance via SHAP

The study highlights a critical pitfall in applying SHAP to jet physics:

Correlation Distortion: Standard SHAP assumes independent inputs. When applied to correlated observables (e.g., multiplicity $n_{pf}$ and girth $w_{pf}$ ), it produces distorted attributions (e.g., incorrectly assigning negative importance to $w_{pf}$ for quark jets).
Solution: By using decorrelated feature sets (e.g., replacing $w_{pf}$ with $r_\lambda$ ), the SHAP analysis correctly identified $n_{pf}$ as the leading feature, followed by $r_\lambda$ and fragmentation terms, aligning with the PCA findings and QCD intuition.

C. Symbolic Regression

The authors derived compact analytical formulas that approximate the neural network's decision boundary:

1D Regression: Formulas for single observables (e.g., $n_{pf}$ ) revealed expected inverse scaling trends but required non-linear activations (like $\tanh$ ) for calibration.
2D Regression: Combining $n_{pf}$ with a second observable (e.g., $r_\lambda$ or $C_{0.2}$ ) significantly improved discrimination power. The resulting formulas were interpretable and captured non-linear interactions.
Full Approximation: A formula using 5 out of 7 leading observables (complexity 22) achieved an AUC of 0.871, nearly matching the full MLP (AUC 0.872). The formula structure revealed a linear dependence on angular correlations ( $C_{0.2}$ ) modulated by a complex non-linear interaction term involving radial moments and entropies.

4. Significance

Physics Validation: The study confirms that deep learning taggers do not learn "black magic" but rediscover known QCD physics (multiplicity, radial flow, fragmentation) and combine them with subtle, decorrelated structures.
Methodological Guidance: It provides a crucial warning for the HEP community: SHAP values are unreliable for correlated jet observables unless inputs are decorrelated or transformed (e.g., via PCA).
Practical Application: The derived symbolic formulas offer a path toward fast, interpretable surrogates for complex ML taggers. These formulas can be evaluated rapidly in large-scale experimental analyses where computational speed is critical, without sacrificing significant performance.
Scientific Discovery: The methodology demonstrates how XAI can bridge the gap between data-driven ML and theoretical physics, potentially revealing new, refined combinations of observables that were not previously obvious.

In summary, the paper establishes a robust framework for "deep-learning the theory" behind jet tagging, proving that high-performance ML models can be decomposed into simple, physically meaningful mathematical expressions.