Neural Scaling Laws for Boosted Jet Tagging

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a computer to distinguish between a "top quark jet" (a spray of particles from a heavy, rare particle) and a "QCD jet" (a common spray from ordinary particles). This is like trying to tell the difference between a rare, exotic fruit and a common apple just by looking at a pile of seeds.

For a long time, scientists in High Energy Physics (HEP) have been building better and better "fruit classifiers," but they haven't been using the same massive amounts of computing power that companies like OpenAI use to build giant language models (like the one you are talking to right now).

This paper asks a simple question: If we just keep adding more brainpower (computing power) and more data, will the computer get infinitely better at this task, or is there a ceiling?

Here is the breakdown of their findings using everyday analogies:

1. The "Recipe" for Success (Scaling Laws)

The authors discovered that improving these models follows a predictable recipe, similar to how baking a cake works.

The Ingredients: You need Model Size (how smart the brain is) and Data Size (how many examples it studies).
The Rule: If you double your computing power, you shouldn't just double the brain size or double the data. You need a specific balance. The paper found the "Golden Ratio" for mixing these ingredients to get the best results for the least amount of effort.
The Result: As you add more computing power, the error rate drops smoothly, like a ball rolling down a hill. But eventually, the ball hits the bottom of the valley.

2. The "Ceiling" (The Irreducible Limit)

Here is the most important finding: You can't get perfect.
Even if you give the computer infinite brainpower and infinite data, it will never reach 100% perfection. There is a "floor" to how well it can do.

The Analogy: Imagine trying to hear a whisper in a noisy room. No matter how good your ears are (model size) or how many times you listen (data), if the room is too noisy, you will never hear the whisper perfectly.
The Twist: The "noise" in this case is the input features. If you only give the computer basic info (like "how heavy is the fruit?"), the ceiling is low. But if you give it detailed info (like "what is the texture, color, and smell?"), the ceiling goes much higher. The paper shows that feeding the computer more detailed, "lower-level" data allows it to reach a much higher level of performance.

3. The "Re-reading" Problem (Data Repetition)

In physics, creating new data is incredibly expensive (it's like simulating a universe on a supercomputer). So, scientists often just make the computer read the same dataset over and over again (multiple "epochs").

The Analogy: It's like studying for a test by reading the same textbook chapter 10 times instead of reading 10 different chapters.
The Finding: Re-reading helps, but it's inefficient.
- If you have a small textbook, reading it 10 times helps you memorize it well.
- But eventually, you hit a point where reading it 11 times doesn't help at all; you just start memorizing the typos (overfitting).
- The Cost: To get the same improvement by re-reading, you have to spend about 10 times more computing power than if you had just generated 10 times more new data.
- The Lesson: It's usually better to generate new, unique data than to keep re-reading the old stuff, unless generating new data is impossible.

4. The "Overfitting" Threshold

The paper also figured out exactly how big the computer's brain needs to be before it starts "memorizing" instead of "learning."

The Analogy: If you have a tiny brain and a huge textbook, you will get confused (underfitting). If you have a giant brain and a tiny textbook, you will memorize every word but fail to understand the concept (overfitting).
The Discovery: There is a specific "sweet spot" where the brain size matches the data size. If you go beyond that, making the brain bigger doesn't help unless you also get more data.

5. Why This Matters for the Future

The authors used these rules to predict the future of particle physics.

They can now tell scientists: "If you want to improve your particle detector by 10%, you need to spend X amount of money on computing and Y amount on generating new data."
They also realized that simulation quality might be the real bottleneck. Even with a perfect computer, if the "simulated universe" data isn't perfect, the computer can't learn the truth.

Summary

This paper is a user manual for the future of AI in physics. It tells us:

Keep scaling up: Bigger models and more data work, but you need to balance them correctly.
There is a limit: You will eventually hit a performance wall, but you can push that wall higher by giving the AI better, more detailed data.
Don't just re-read: It's better to get new data than to study the same data over and over, because re-reading gets expensive very quickly.

In short, they've turned the "black art" of training AI models into a predictable science, allowing physicists to plan their resources like a chef planning a massive banquet.

1. Problem Statement

High Energy Physics (HEP) relies heavily on machine learning for tasks like jet tagging (distinguishing boosted heavy particles like top quarks or Higgs bosons from background QCD jets). While Large Language Models (LLMs) have demonstrated that scaling compute (jointly increasing model size and dataset size) is the primary driver of performance, HEP models typically operate with compute budgets orders of magnitude smaller than industry foundation models.

The core problem addressed is the lack of a quantitative framework to predict how performance improves with increased resources in HEP. Specifically, the authors investigate:

Whether neural scaling laws (power-law relationships between loss, model size, and data) hold for jet classification.
How data repetition (common in HEP due to the high cost of simulation) affects scaling efficiency.
How input feature choices (kinematics vs. full particle flow) and particle multiplicity influence the asymptotic performance limits.

2. Methodology

Dataset and Architecture

Dataset: The study uses the public JetClass dataset, containing 100M simulated jets for training, 5M for validation, and 20M for testing. Jets are categorized into 10 classes (QCD background vs. boosted $W, Z, H, t$ ).
Model Architecture: A Set Transformer encoder is used. Jets are treated as variable-length sequences of particles (up to 128).
- Inputs: 21 features per particle (kinematics, particle ID, track parameters).
- Structure: No positional encoding (invariant to ordering). Particles are sorted by transverse momentum ( $p_T$ ) for deterministic truncation. A learnable [CLS] token aggregates the jet representation.
- Scaling: Model capacity ( $N$ ) is scaled by varying the embedding dimension ( $d$ ).

Training Regimes

The authors analyze two distinct training regimes:

Compute-Optimal (Single Pass): Training for exactly one epoch with no data repetition. This follows the theoretical framework by Hoffmann et al. (2022) for LLMs.
Data Repetition (Multi-Epoch): Training for multiple epochs on a fixed dataset, simulating real-world HEP constraints where generating new simulation data is expensive.

Scaling Law Formulation

The loss $L$ is modeled as a function of model size $N$ and dataset size $D$ :
$L(N, D) = L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$
Where:

$L_\infty$ : Irreducible loss (asymptotic limit).
$\alpha, \beta$ : Scaling exponents for model capacity and data size, respectively.
$A, B$ : Coefficients.

The compute-optimal allocation minimizes loss for a fixed compute budget $C \approx 6 \cdot N \cdot D$ (accounting for FLOPs).

3. Key Contributions

A. Derivation of Compute-Optimal Scaling Laws

The authors successfully fitted the parametric form to jet classification data, finding:

Exponents: $\alpha \approx 0.44$ (model scaling) and $\beta \approx 0.22$ (data scaling).
Optimal Trajectory: The optimal balance between model size and data size follows $N \propto C^a$ and $D \propto C^{1-a}$ , where $a = \beta / (\alpha + \beta)$ .
Result: Performance improves as a power law with compute ( $L \propto C^{-\gamma}$ ), confirming that scaling laws are applicable to HEP tasks.

B. Quantifying Data Repetition Effects

Since HEP often reuses data, the authors analyzed the impact of training for multiple epochs:

Overfitting Threshold: They identified a threshold $N \propto D^{0.47}$ (roughly square-root scaling) above which models begin to overfit on a fixed dataset.
Effective Dataset Size: Training above this threshold allows the model to reach a lower loss, effectively acting as if the dataset were larger. However, this comes at a cost:
- The scaling exponent $\beta$ remains roughly constant.
- The normalization factor $B$ decreases (improving data efficiency).
- Cost: Achieving the same loss via data repetition requires roughly 10x more compute than the compute-optimal single-pass regime.
- Diminishing Returns: The gain is bounded; eventually, repetition leads to saturation or degradation, making new simulation data more efficient than further repetition.

C. Impact of Input Features and Multiplicity

The study varied input configurations (kinematics only vs. full 21 features) and particle counts (10, 40, 128 particles):

Scaling Exponent Stability: The data scaling exponent $\beta$ remained stable ( $\approx 0.21-0.26$ ) across all configurations.
Asymptotic Limit ( $L_\infty$ ): The choice of features significantly altered the performance ceiling ( $L_\infty$ $L_{\infty}$ ).
- Kinematics-only: $L_\infty \approx 0.74$
- Full features (128 particles): $L_\infty \approx 0.32$
Conclusion: Richer, lower-level features do not change how fast performance scales with data, but they significantly raise the maximum achievable performance.

4. Key Results

Predictive Power: The derived scaling laws accurately predict performance for existing benchmarks (e.g., ParT architecture on 100M samples) and extrapolate to larger scales.
Physics Metrics: The authors mapped cross-entropy loss to QCD background rejection at fixed signal efficiency.
- Results show that richer inputs (full features, higher multiplicity) yield substantially higher background rejection.
- The asymptotic performance limits derived from scaling laws align with observed ROC curves.
Simulation Fidelity: The asymptotic limits observed with fast simulation (JetClass) are lower than those reported by ATLAS using full detector simulation. This suggests that simulation fidelity is a current bottleneck, and scaling laws can serve as a diagnostic tool to quantify the impact of simulation quality on discrimination power.

5. Significance

Resource Allocation: The paper provides a quantitative framework for HEP experiments to decide whether to invest in larger models, more simulation data, or longer training epochs. It suggests that for fixed datasets, simply increasing epochs has diminishing returns compared to generating new data once the overfitting threshold is crossed.
Feature Engineering: It demonstrates that investing in more expressive, lower-level input features (particle flow) is crucial for breaking performance ceilings, rather than just scaling model size.
Foundation Models in HEP: The work validates the applicability of "foundation model" scaling principles to scientific domains, paving the way for the development of larger, more capable models for the High-Luminosity LHC (HL-LHC) era.
Diagnostic Tool: Scaling laws can be used to detect when a model has hit a limit imposed by data quality (simulation) rather than model capacity, guiding improvements in detector simulation.

In summary, this paper establishes that scaling compute reliably drives jet tagging performance toward a well-defined asymptotic limit, and that this limit is determined by the expressiveness of the input features and the fidelity of the simulation, not just the amount of compute.