Neural Scaling Laws for Jet Generation

Original authors: Oz Amram, Darius A. Faroughy, Tjarko Gerdes, Anna Hallin, Gregor Kasieczka, Michael Krämer, Humberto Reyes-Gonzalez, David Shih

Published 2026-05-29

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Oz Amram, Darius A. Faroughy, Tjarko Gerdes, Anna Hallin, Gregor Kasieczka, Michael Krämer, Humberto Reyes-Gonzalez, David Shih

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Teaching a Robot to "Dream" Particle Collisions

Imagine you are trying to teach a robot to paint. In the world of Artificial Intelligence (AI), there is a famous rule called a "Scaling Law." It basically says: If you give the robot a bigger brain (more parameters), more paint samples (more data), or more time to paint (more computing power), it will get better at painting in a predictable, mathematical way.

This paper asks a simple question: Does this rule work for particle physics?

Specifically, the researchers wanted to see if they could train a robot to "dream up" (generate) realistic particle jets. In particle physics, when protons smash together, they spray out clouds of particles called jets. These are messy, chaotic, and follow the laws of quantum mechanics. The team trained a model called OmniJet-α to learn the patterns of these jets and then generate new, fake ones that look just like the real thing.

The Three Ingredients for Success

To test their theory, the researchers tweaked three main ingredients, just like a chef adjusting a recipe:

Model Size (The Brain): They made the AI's "brain" bigger and bigger, from a tiny "Pico" brain to a massive "XXL" brain.
Dataset Size (The Textbook): They fed the AI more and more examples of real jets, ranging from a few million to hundreds of millions.
Compute (The Time/Effort): They gave the AI different amounts of computing power to study the data.

What They Found: The "Easy" Part vs. The "Hard" Part

1. The Brain Gets Bigger (Model Size) → Success!

When they made the AI's brain bigger, it got significantly better at its job.

The Analogy: Imagine a student taking a test. As you give them a bigger brain (more knowledge), their test score goes up in a smooth, predictable curve.
The Result: The paper found a clear mathematical rule here. Bigger models = better predictions.
The Bonus: They checked if the AI was just memorizing the test or actually understanding physics. They measured how well the "fake" jets matched real physics rules (using something called the Sliced Wasserstein Distance). They found that as the test scores went up, the physics quality went up too. The math and the physics were perfectly in sync.

2. The Textbook Gets Bigger (Dataset Size) → Not Much Change

When they fed the AI more data, the improvement was surprisingly small.

The Analogy: Imagine a student who has already read the entire encyclopedia. If you give them another encyclopedia, they don't learn much more because they've already mastered the basics.
The Result: The AI seemed to hit a "ceiling" very quickly. Even with a small amount of data, it learned almost everything it could about the general shape of the jets. Adding more data didn't help much because the AI had already learned the "easy" stuff.

3. More Time/Effort (Compute) → Flat Lines

When they gave the AI more computing power to train, the results didn't improve much either.

The Analogy: Imagine a student who finishes a test in 10 minutes and gets an A. If you give them 10 hours to take the same test, they won't get an A+; they just get bored.
The Result: The AI learned so fast that even small models reached their maximum potential very quickly. Giving them more time to study didn't make them smarter.

The Secret Sauce: The "Learnable Window"

Why did the AI stop learning so fast? The authors introduced a clever concept called the "Learnable Window."

The Concept: Think of the total information in the data as a big room. Some of the room is filled with clear, learnable patterns (the "window"). The rest of the room is filled with pure chaos and randomness (noise).
The Discovery: In language models (like the ones that write this text), the "window" is huge. There is so much structure in language that a bigger brain can keep finding new patterns for a long time.
The Twist: In particle jets, the "window" is tiny. Because particle physics is governed by quantum mechanics, it is inherently stochastic (random). The AI quickly learned all the predictable patterns, and the rest of the data was just random noise that no amount of brainpower could predict.
The Metaphor: It's like trying to predict the exact path of a single raindrop in a storm. You can learn the general pattern of the storm (the wind, the clouds), but the specific path of one drop is random. The AI learned the storm quickly, but it couldn't learn the randomness of the drop, no matter how big its brain got.

The Conclusion

This paper is the first to show that neural scaling laws exist for particle physics, but they behave differently than they do for language.

Good News: Bigger models do work, and they get better at physics.
The Catch: The AI hits a wall very quickly because the data is naturally random. You can't just throw infinite money and data at the problem to get infinite improvements; the "randomness" of the universe sets a hard limit on how well the AI can predict.

In short: The AI is a brilliant student, but the subject matter (quantum physics) is so chaotic that even the smartest student can only learn so much before they start guessing.

Technical Summary: Neural Scaling Laws for Jet Generation

Problem Statement
Neural scaling laws, which describe the power-law relationship between model performance and dataset size, compute, and model parameters, have become central to modern artificial intelligence, particularly in large language models (LLMs). However, their applicability to high-energy physics (HEP) remains an open question. Collider data differs qualitatively from natural language and vision data: it is highly stochastic due to the nature of quantum field theory (QCD radiation) yet constrained by physical dynamics. Furthermore, while scaling laws have been observed in supervised jet classification tasks, their behavior in generative modeling—specifically for particle jets—is less understood. This work investigates whether empirical scaling laws hold for the task of generating particle jets using foundation models, and whether improvements in the training objective (next token prediction) translate to improvements in physically meaningful observables.

Methodology
The study utilizes OmniJet-α, an autoregressive GPT-style transformer trained on tokenized jet constituents via next token prediction (NTP). The model converts jet constituents (kinematic features like transverse momentum $p_T$ and relative angles) into integer tokens using a Vector Quantized Variational Autoencoder (VQ-VAE) with a codebook size of 32,768.

The research is conducted on the Aspen Open Jets (AOJ) dataset, derived from CMS Open Data, containing approximately 180 million reconstructed jets from proton-proton collisions. This represents the first investigation of neural scaling laws on experimentally recorded collider data rather than Monte Carlo simulations.

The study is divided into three phases to analyze scaling with respect to:

Model Size ( $N$ ): Varying parameters from 25k to 85 million non-embedding parameters while keeping dataset size and compute budget fixed.
Dataset Size ( $D$ ): Varying the number of unique training tokens from $6.4 \times 10^6$ to $8.1 \times 10^9$ with a fixed model architecture.
Compute ( $C$ ): An isoFLOP analysis varying model size and training steps for fixed compute budgets to identify compute-optimal scaling.

Two primary metrics are evaluated:

NTP Validation Loss: The standard cross-entropy loss for the next token prediction task.
Sliced Wasserstein Distance (SWD): A statistical metric computed on five high-level jet observables ( $p_T$ , mass $m$ , $\tau_{21}$ , $\tau_{32}$ , and constituent count $n$ ) that were not directly available to the model during training. This measures the quality of the generated jets in physics space.

The authors introduce the concept of a "learnable window" ( $W$ ), defined as the gap between the loss of a uniform predictor ( $\log V$ ) and the irreducible entropy floor of the dataset ( $H(p)$ , estimated by the asymptotic loss $L_\infty$ ). This metric quantifies the fraction of the total loss range that is learnable versus intrinsic stochasticity.

Key Results

Model Size Scaling: The study confirms a clear power-law scaling behavior for the NTP validation loss as a function of model size ( $L(N) \propto N^{-\beta_N} + L_\infty$ ). The scaling exponent $\beta_N$ is approximately 0.43. Crucially, the SWD metric exhibits a monotonic correlation with the NTP loss, indicating that improvements in the training objective directly translate to better modeling of physical observables. The SWD values approach the intrinsic statistical floor associated with finite-sample comparisons of real data.
Dataset and Compute Scaling: Scaling with dataset size and compute yields substantially weaker signals. While the data is compatible with power-law interpretations, the dynamic range is small, and statistical uncertainties are large. The models appear to saturate rapidly; even the smallest models capture a vast majority of the learnable structure.
The Learnable Window: A striking finding is the small size of the learnable window for jet generation compared to language modeling. For OmniJet-α, the learnable window $W$ is approximately 3.2 nats, compared to ~8.7 nats in comparable language model studies. Consequently, the effective perplexity ( $e^{L_\infty}$ ) is 1330, significantly higher than the ~5.4 observed in language models. This suggests that the dominant structures in the jet distribution are learned with relatively modest resources, and the remaining loss is dominated by intrinsic stochasticity rather than reducible error.
IsoFLOP Curves: The isoFLOP curves (loss vs. model size for fixed compute) are unusually flat, lacking the distinct "U-shape" with a clear left flank seen in language models. This makes the extraction of a compute-optimal model size highly uncertain, though a parabolic fit suggests an optimal scaling exponent $a \approx 0.92$ for model size vs. compute.

Significance and Claims
The paper claims to be the first to systematically explore neural scaling laws for jet generative models on real collider data. Its primary contributions are:

Validation of Scaling Laws: It demonstrates that logarithmic scaling laws for model size do exist in jet generation and that the NTP loss is a reliable proxy for physical performance (SWD).
Rapid Saturation: It identifies that autoregressive jet generation saturates much faster than language modeling, likely due to the stochastic nature of QCD radiation and the dominance of "featureless" QCD jets in the dataset.
Learnable Window Concept: By introducing the learnable window, the authors provide a framework to explain why scaling gains are weak in this domain: the "learnable" portion of the data distribution is small relative to the total entropy.
Domain Specificity: The results suggest that scaling behaviors in HEP are sensitive to the task structure. While supervised jet classification shows continued scaling over large ranges, generative modeling of generic QCD jets approaches saturation early. This implies that pre-training strategies successful in language may require domain-specific adaptations for particle physics, particularly regarding codebook resolution and the ordering of constituents.

The authors conclude that while scaling laws are present, the diminishing returns and rapid saturation observed in this study highlight the unique challenges of unsupervised pre-training on particle physics data, where the underlying physics imposes a high degree of irreducible stochasticity.