Scaling with Collapse: Efficient and Predictable Training of LLM Families

Imagine you are trying to teach a group of students of different sizes—from a kindergartener to a PhD candidate—the same complex subject.

In the world of Artificial Intelligence, these "students" are Large Language Models (LLMs). Usually, when researchers train these models, they hit a wall: scale is unpredictable. A strategy that works perfectly for a small model might fail miserably for a giant one. It's like trying to use a recipe for a single cookie to bake a cake for a thousand people; the math just doesn't add up, and you often end up with a burnt mess or a raw dough.

This paper, "Scaling with Collapse," introduces a revolutionary way to train these AI models that makes the process predictable, efficient, and even allows us to spot problems before they ruin the whole batch.

Here is the breakdown using simple analogies:

1. The Problem: The "Llama-2" Mess

The authors look at famous models like Llama-2. They noticed that when you train a 7-billion-parameter model and a 70-billion-parameter model, their "learning curves" (graphs showing how well they are learning over time) look completely different.

The Analogy: Imagine two runners. One is a sprinter, the other a marathoner. If you plot their progress, their paths are chaotic and don't match. You can't tell if the marathoner is doing well just by looking at the sprinter's pace. This makes training huge models a game of "guess and check," which is incredibly expensive and slow.

2. The Discovery: "The Great Collapse"

The researchers found a secret sauce. If you tune the training settings just right, all the different-sized models suddenly start running on the exact same track.

They call this "Collapse."

The Analogy: Imagine you have a map of a mountain. If you zoom in on a small hill or zoom out to a massive mountain range, the shape of the slope looks different. But, if you normalize the map (adjust the scale), you realize that every mountain has the exact same slope profile.
In the paper, they show that if you fix three specific "knobs" (how much data the model sees per parameter, how fast the optimizer "forgets" old mistakes, and the learning rate schedule), the learning curves of a 300-million-parameter model and a 3.9-billion-parameter model collapse onto a single, universal line.

3. The Three Magic Knobs

To get this "Collapse" to happen, you have to tune three things perfectly:

TPP (Tokens Per Parameter): How much "reading material" each part of the brain gets to study.
$\tau$ (Tau - The Memory Knob): This controls how long the model remembers its past mistakes. Think of it like a student's short-term memory. If the memory is too short, they forget what they just learned. If it's too long, they get stuck on old errors. The paper found a "Goldilocks" setting for this based on how much data is being used.
The Learning Rate Schedule: How fast the model learns at the start versus the end.

When these three are set correctly, the models of all sizes behave identically.

4. Why This Matters: Two Superpowers

Once you have this "Collapse," you get two superpowers:

A. The "Canary in the Coal Mine" (Early Warning System)

Usually, when a training run goes wrong (due to a computer glitch or bad data), the loss curve (the error rate) might look fine for a long time before suddenly spiking. By the time you see the spike, you've wasted millions of dollars in computing power.

The Analogy: Imagine you are driving a fleet of cars. If one car starts to drift slightly off the universal road, you know immediately that something is wrong with that specific car's engine, even if the speedometer hasn't changed yet.
The Result: The authors used this to spot a hidden computer error in their 1.8-billion model weeks before the error would have been visible to the naked eye. They fixed it and saved the training run.

B. The "Crystal Ball" (Early Stopping)

Training a giant AI model takes months and costs a fortune. Usually, you have to wait until the very end to see if your settings (like batch size or learning rate) were good.

The Analogy: Imagine you are baking 100 different batches of cookies to find the perfect recipe. Normally, you have to wait until they are all fully baked and cooled to taste them. That takes forever.
The Result: Because all the curves collapse onto a predictable path, the authors can look at a cookie after just 10% to 30% of the baking time. By comparing that partial cookie to their "universal recipe," they can predict exactly how the final cookie will taste. They can stop the bad batches early and save massive amounts of time and money.

5. The Result: "Celerity"

Using these insights, the team built a new family of models called Celerity.

They trained these models to be highly efficient (using less computing power to get the same smarts).
They proved that by sticking to the "Collapse" rules, they could build models that compete with the biggest, most expensive models out there, but with a fraction of the waste.

Summary

This paper is like finding the universal grammar of learning for AI.

Before: Training AI was like trying to teach a class where every student needed a completely different, unpredictable lesson plan.
Now: We found that if we set the right "memory" and "reading speed," every student, from the smallest to the largest, learns in the exact same rhythm.
The Benefit: We can now predict the future of a training run, spot errors instantly, and stop wasting money on bad experiments. It turns AI training from a chaotic gamble into a precise, predictable science.

1. Problem Statement

Large Language Model (LLM) training relies on "scaling laws" to predict performance based on model size ( $N$ ) and dataset size ( $D$ ). However, while final loss and optimal hyperparameters are predictable, the shape of the training loss curve (TLC) across different model sizes has remained less understood.

Recent work (Qiu et al., 2025) showed that TLCs could "collapse" onto a universal trajectory under specific conditions (Maximal Update Parameterization, $\mu$ P) on small-scale tasks. However, it was unclear if this phenomenon persists in practical, large-scale LLM training where width, depth, batch size, and weight decay are scaled jointly. Without this predictability, practitioners struggle to:

Diagnose training pathologies (e.g., loss spikes, numerical instability) early.
Determine if a model is under- or over-trained.
Efficiently tune hyperparameters for massive models without running them to completion.

2. Methodology

The authors investigate the conditions required for TLCs to collapse across model scales and introduce a new model family, Celerity, trained under these conditions.

Key Theoretical Controls

The paper identifies three critical controls that must be aligned across model sizes to achieve collapse:

Tokens-Per-Parameter (TPP): The ratio of total training tokens ( $D$ ) to model parameters ( $N$ ). This dictates the relative pace of improvement.
AdamW Timescale ( $\tau$ ): Defined as $\tau = \frac{B}{\eta \lambda D}$ $τ = \frac{B}{η λ D}$ , where $B$ $B$ is batch size, $\eta$ $η$ is learning rate, and $\lambda$ $λ$ is weight decay. This parameter governs the bias-variance trade-off in the optimizer's Exponential Moving Average (EMA) of gradients.
- Finding: Optimal $\tau$ is not constant; it scales as a power law with TPP.
Learning Rate (LR) Schedule: Specifically, the paper advocates for a linear decay-to-zero (D2Z) schedule, which phases early bias reduction against late-stage variance suppression.

The Celerity Model Family

To demonstrate these principles, the authors trained the Celerity family of LLMs (ranging from 300M to 3.9B parameters) using:

Fixed TPP Bands: Models were trained at specific TPP ratios (20, 80, and 234).
Optimal $\tau$ Scaling: For each TPP band, $\tau$ was tuned on small proxy models and transferred to larger scales using the established power law.
Parameterization: Use of CompleteP (an extension of $\mu$ P that handles depth scaling) and SwiGLU activations.
Data: A curated mix of educational, math, and coding data (FineWeb-Edu, StarCoder, etc.) rather than generic web crawls.

Predictive Modeling for Early Stopping

The authors propose a functional form to predict normalized TLCs:
$\hat{\ell}(\hat{t}) \approx \left(\frac{1+\epsilon_1}{\hat{t}+\epsilon_1}\right)^m + b \cdot (\eta(\hat{t}) + \epsilon_2)^q$
Where parameters $b$ and $q$ are modeled as power laws of $\tau$ and TPP, respectively. This allows predicting the final loss of a large model by fitting a small-scale proxy and aligning partial training curves.

3. Key Contributions

Identification of Collapse Conditions: The paper proves that TLCs collapse across scales if and only if the TPP ratio, the AdamW timescale $\tau$ , and the LR schedule are held consistent (or scaled optimally) across model sizes.
The Celerity Family: Introduction of the first large-scale LLM family trained specifically to exhibit collapse. Celerity models achieve competitive accuracy on the compute-efficiency frontier (Pareto frontier of accuracy vs. FLOPs).
Collapse as a Diagnostic Tool: The authors demonstrate that deviations from the collapsed curve serve as a highly sensitive, early-warning system for training issues (e.g., numerical instabilities, kernel bugs) that are invisible in raw loss plots until much later.
Efficient Hyperparameter Tuning: A methodology for early stopping in hyperparameter tuning. By aligning partial training runs (10–30% of tokens) with a small-scale universal curve, practitioners can predict final loss and select optimal hyperparameters without training large models to completion.

4. Results

Validation of Collapse:
- Llama-2 (Failure): Llama-2 models (7B–70B) trained with varying TPP and non-optimized $\tau$ showed no collapse; their normalized curves diverged significantly.
- Celerity (Success): Celerity models trained at fixed TPP (e.g., 234) with optimal $\tau$ showed tight collapse. Curves from 300M to 3.9B parameters aligned almost perfectly after normalization.
Diagnostic Capability:
- In a 1.8B Celerity run, a numerical instability in a loss kernel caused a divergence. The collapse residual (difference between the running curve and the universal reference) detected the issue at 60% of training, whereas the raw loss curve only showed a visible spike at 90%. This allowed for a targeted fix and restart, saving significant compute.
Compute Efficiency:
- Celerity models trained at 234 TPP (high parameter efficiency) achieved accuracy comparable to compute-optimal models (20 TPP) but with ~62% fewer parameters.
- Celerity outperformed BTLM (trained before task-specific data annealing became standard) with 75% fewer training FLOPs.
Early Stopping Accuracy:
- Using the predictive model, the authors could identify the best hyperparameter setting (e.g., batch size or weight decay) with negligible loss gap after training only 10–30% of the tokens. This contrasts with "current best" heuristics, which often fail to predict the true winner in large-scale sweeps.

5. Significance and Impact

Operationalizing Predictability: The paper moves scaling laws from theoretical performance prediction to operational training control. By treating the "collapsed curve" as a reference trajectory, teams can monitor training health in real-time.
Cost Reduction: The ability to stop hyperparameter tuning early (at 10–30% of training) offers massive compute savings for large-scale LLM development, where full runs are prohibitively expensive.
Robustness: The findings suggest that the "shape" of learning is governed by fundamental optimizer dynamics ( $\tau$ ) and data exposure (TPP), rather than specific architectural details, provided the training regime is optimized.
Open Science: The Celerity models and training recipes are fully open, providing a transparent baseline for future research into compute-efficient training, distinct from models that rely on proprietary data curation or benchmark-specific annealing.

In summary, "Scaling with Collapse" establishes that with the correct hyperparameter scaling (specifically $\tau$ and TPP), LLM training becomes highly predictable. This predictability enables a new paradigm of training: using small-scale proxies to guide large-scale runs, detecting errors instantly via curve deviation, and significantly reducing the computational cost of model development.