Towards Engineering Scaling Laws with Pretraining Data… — Plain-Language Explanation

Original authors: Jan-Lucas Uslu, Kevin Greif, Daniel Whiteson, Benjamin Nachman

Published 2026-06-19

📖 4 min read🧠 Deep dive

Original authors: Jan-Lucas Uslu, Kevin Greif, Daniel Whiteson, Benjamin Nachman

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a student to recognize different types of vehicles in a busy city. You have two main ways to help them learn: you can either give them a bigger brain (a larger model) or you can give them more practice problems (more data).

For a long time, scientists studying Artificial Intelligence (AI) have believed there is a "golden rule" for this. They thought that if you have a fixed amount of time and money (compute budget), the best way to get the smartest student is to split your resources roughly 50/50 between building a bigger brain and giving them more practice problems.

However, this new paper suggests that in the world of particle physics, we can engineer a better rule by changing what the student learns first.

The Setup: The Physics Classroom

The researchers are working with "jets." In particle physics, when tiny particles smash together, they spray out streams of other particles called jets. It's like a firework exploding, but instead of sparks, you get streams of subatomic particles.

The goal is to teach an AI to look at these streams and say, "Ah, this one came from a specific type of explosion!"

The Experiment: Changing the Textbook

The researchers tested two different "textbooks" (pretraining datasets) to see how they changed the learning rules:

The Boring Textbook (QCD only): This book only contained examples of "standard" particle explosions. It was like a driving school that only taught you how to drive a standard sedan.
The Diverse Textbook (BSM enhanced): This book included the standard examples plus complex, rare, and exotic explosions that don't happen in our normal universe (simulated "Beyond Standard Model" physics). It was like a driving school that taught you to drive sedans, but also race cars, trucks, and even flying vehicles.

The Discovery: Rewriting the Rules

When they trained the AI using the Boring Textbook, the old 50/50 rule held true. To get better results, you had to balance making the brain bigger and giving it more practice.

But when they used the Diverse Textbook, the rules changed completely. The AI learned that more practice problems were far more valuable than a bigger brain.

The Analogy: Imagine the AI trained on the diverse textbook is like a student who has already seen every type of vehicle imaginable. When you give them a new test, they don't need a bigger brain to understand the new car; they just need to see more examples of it to get perfect. Their "brain" doesn't need to grow as fast because their "experience" is so rich.

The Result: The New "Data-First" Strategy

The paper found that by using the diverse, exotic data for the initial training:

The "bigger brain" strategy became less important.
The "more data" strategy became the winner.

In fact, the researchers found that for every unit of computing power you spend, you should spend about 78% of it on getting more data and only 22% on making the model bigger. This is a huge shift from the old 50/50 split.

Why This Matters for Physics

The paper highlights a unique advantage of physics: We can make our own data.

In fields like medicine or language, getting new data is hard, expensive, or impossible (you can't just "simulate" a new human patient). But in particle physics, scientists use powerful computers to simulate particle collisions. They can generate infinite amounts of high-quality, diverse data for free (once the simulation is running).

The Takeaway:
If you are building a super-smart AI for physics, don't just try to build the biggest possible brain. Instead, spend your time and money engineering a better, more diverse curriculum for the AI to learn from first. Once the AI has seen a wide variety of "exotic" examples, it will learn faster and better from the specific task you give it, and you will get better results by feeding it more data rather than making the model larger.

In short: A well-chosen, diverse diet of training data is more powerful than a bigger brain.

Technical Summary: Towards Engineering Scaling Laws with Pretraining Data Composition

Problem Statement
Neural scaling laws describe how model performance improves as a power law with respect to compute, model size, and dataset size. While well-established for large language models (LLMs), these relationships are emerging in particle physics. A key distinction in fundamental physics is the ability to generate high-fidelity synthetic data via simulators at a relatively low cost compared to the computational expense of training larger models. This creates a unique opportunity to engineer the pretraining dataset itself to influence scaling behavior. The central question addressed is whether the composition of pretraining data—specifically its diversity and alignment with downstream tasks—can be engineered to shift the compute-optimal scaling regime from favoring larger models to favoring larger datasets.

Methodology
The study focuses on the task of classifying hadronic jets produced in high-energy particle collisions. The authors utilize a generic transformer architecture that processes jet data as a point cloud, varying model sizes from approximately 3,000 to 10.5 million parameters (spanning three orders of magnitude) while keeping depth and attention head dimensions fixed.

The experimental design involves a two-stage training protocol:

Pretraining: Models are pretrained on subsets of the JetClass-II dataset, which contains 188 classes of simulated jets. The authors define four distinct pretraining subsets to manipulate diversity and alignment:
- QCD: Only jets initiated by light quarks or gluons (17 classes).
- QCD + res2p: QCD jets plus jets from two-body decays of Beyond the Standard Model (BSM) resonances.
- QCD + res34p: QCD jets plus jets from three- or four-body decays of BSM resonances.
- QCD + res2p + res34p: The full dataset including all BSM resonance decays.
- Note: BSM subsets introduce greater diversity (more process classes, broader phase-space coverage) and better alignment with the downstream task (multi-prong topologies) compared to QCD-only data.
Fine-tuning: Pretrained models are fine-tuned on the original JetClass dataset for a 10-class jet classification task (identifying light quarks/gluons, top quarks, W/Z bosons, and Higgs particles). This task requires identifying prong multiplicity and mass scales, which are well-represented in the BSM-augmented pretraining data but poorly represented in QCD-only data.

Scaling exponents are extracted by fitting power laws to the compute-optimal model size ( $N^*$ ) and dataset size ( $D^*$ ) as a function of total compute ( $C$ ). The study compares these exponents across "scratch" training (no pretraining) and the various pretraining configurations.

Key Results
The study demonstrates that pretraining data composition significantly alters the compute-optimal scaling exponents:

Scratch Training: Training from scratch yields exponents of $a \approx 0.52$ (model size) and $b \approx 0.48$ (dataset size), indicating a roughly balanced allocation of compute resources between model size and data, consistent with findings in LLMs.
QCD-Only Pretraining: Pretraining solely on QCD jets results in a marginal shift ( $a \approx 0.53, b \approx 0.47$ ), suggesting that pretraining alone without specific alignment or diversity does not fundamentally change the scaling regime.
BSM-Augmented Pretraining: Including BSM resonance decays in the pretraining corpus causes a dramatic shift. With the full BSM-augmented dataset, the exponents shift to $a \approx 0.22$ $a \approx 0.22$ and $b \approx 0.78$ $b \approx 0.78$ .
- This indicates a regime where the compute-optimal strategy heavily favors increasing the dataset size over increasing the model size.
- The shift represents a factor-of-2.3 reduction in the scaling exponent for model size compared to the scratch baseline.
- Fine-tuning loss curves confirm that BSM-enhanced pretraining consistently lowers loss across all model sizes, with benefits increasing for larger models.

Key Contributions

Engineering Scaling Laws: The paper provides the first systematic study showing that pretraining data composition can be engineered to shift scaling exponents in fundamental physics. It demonstrates that diversity and downstream alignment in the pretraining corpus can move the optimal scaling regime toward data-favoring strategies.
Quantitative Shift: The work quantifies the shift from a balanced scaling regime ( $a \approx b \approx 0.5$ ) to a strongly data-favoring regime ( $a \approx 0.22, b \approx 0.78$ ) by incorporating BSM physics into pretraining.
Implications for Foundation Models: The results suggest that foundation models pretrained on diverse and aligned synthetic data can achieve optimal performance with smaller parameter counts, allowing saved compute budgets to be redirected toward generating additional training data.

Significance and Claims
The authors claim that this work identifies a new design space for scientific machine learning: the physics inputs to foundation model training. Unlike natural language or image domains where data curation is limited by availability, fundamental physics can leverage cheap, high-fidelity simulators to construct pretraining corpora that actively shape scaling laws.

The paper modestly concludes that while pretraining on well-composed corpora allows for a regime where downstream compute is best spent on more data rather than larger models, further work is required to verify if these results generalize across different fine-tuning tasks, larger model scales, and different dataset sizes. The study does not claim to have solved all scaling challenges but highlights pretraining composition engineering as an underexplored lever for maximizing the discovery potential of scientific foundation models.

Towards Engineering Scaling Laws with Pretraining Data Composition