What Scales in Cross-Entropy Scaling Law?

The Big Picture: The "Magic Formula" That Stopped Working

Imagine you are building a giant library of knowledge (a Large Language Model, or LLM). For years, scientists believed in a "Magic Formula" called the Cross-Entropy Scaling Law.

The Old Belief:
The formula said: "If you double the size of your brain (the model) and the amount of books you read (the data), your mistakes will drop by a predictable, steady amount." It was like a recipe that guaranteed better results the more ingredients you added.

The Problem:
Recently, scientists noticed something weird. When they built massive models (the super-giant ones), the "Magic Formula" stopped working. The models got better, but much slower than the recipe predicted. It was like adding more flour to a cake, but the cake stopped rising. Everyone was confused: Why did the rule break?

The Investigation: Taking the Cake Apart

The authors of this paper decided to investigate. They hypothesized that the "Cross-Entropy" score (the measure of mistakes) wasn't a single, solid thing. Instead, they thought it was a mixture of three different ingredients hiding inside.

To find out what was really happening, they invented a new way to look at how a model makes a mistake. Instead of just looking at the probability (a number between 0 and 1), they looked at the Rank (the position on a list).

Analogy: Imagine a teacher grading a test.
- Old Way: The teacher looks at the score: "You got a 92%."
- New Way (Rank-Based Error): The teacher looks at the ranking: "You got the 3rd best answer out of 100 options."
- The authors argue that the ranking is a more honest measure of intelligence than the specific number score.

The Three Ingredients (The Decomposition)

They mathematically split the "Mistake Score" into three distinct parts:

Error-Entropy (The "Knowing What's Right" Part):
- What it is: This measures how well the model knows the correct answer is at the top of the list.
- The Metaphor: This is the student actually studying and learning the material. As the student gets smarter, they put the right answer at #1 more often.
- The Finding: This part DOES follow the Magic Formula. It gets better and better as the model grows.
Self-Alignment (The "Confidence Matching" Part):
- What it is: This measures how well the model's internal confidence matches its actual ranking.
- The Metaphor: This is the student trying to sound confident. If they think they are right, they say "I'm 100% sure!" If they are wrong, they say "I'm 10% sure."
- The Finding: This part DOES NOT scale. It stays random and messy, regardless of how big the model gets.
Confidence (The "Arrogance" Part):
- What it is: This measures how high the probability scores are.
- The Metaphor: This is the student shouting "I AM RIGHT!" very loudly. A model can be very loud (high confidence) even if it's not actually learning the material faster.
- The Finding: This part DOES NOT scale. It just fluctuates.

The "Aha!" Moment: Why the Formula Broke

Here is the secret the paper reveals:

In Small Models: The "Error-Entropy" (the actual learning) was the only thing that mattered. It made up 90% of the total score. So, when the model grew, the whole score improved perfectly. The Magic Formula worked!
In Giant Models: As the models got huge, the "Error-Entropy" (learning) stopped being the only thing. The "Confidence" and "Self-Alignment" parts started taking up a bigger slice of the pie.
- Since those other two parts don't follow the Magic Formula, they started dragging the whole score down.
- The Result: The total score stopped improving as fast as expected, not because the model stopped learning, but because the "noise" (confidence and alignment) became too loud.

The Solution: A New Compass

The authors propose we stop looking at the whole "Cross-Entropy" score and start focusing only on Error-Entropy.

Old Compass: "Look at the total mistake score." (Confusing and broken for big models).
New Compass: "Look only at the Error-Entropy." (This still follows the perfect, predictable path).

Why This Matters

Better Training: If we know that "Confidence" is just noise, we can train models to ignore it and focus purely on getting the ranking right. This could make training more efficient.
Understanding AI: It tells us that the "intelligence" of a model is about ranking the right answer, not just assigning high probability numbers.
Future Predictions: We can now predict how big models will behave much more accurately by ignoring the parts that don't scale.

In Summary:
The paper says the "Magic Formula" didn't actually break; we just stopped looking at the right part of the equation. The "learning" part (Error-Entropy) is still scaling perfectly, but it's getting drowned out by the "confidence" part. If we focus on the learning, the path forward for AI becomes clear again.

1. Problem Statement

The Cross-Entropy (CE) Scaling Law has been a cornerstone for guiding Large Language Model (LLM) development, positing that loss decreases predictably as a power-law function of model size and dataset size. However, recent empirical evidence suggests this law breaks down at very large scales: the loss reduction slows down significantly compared to the expected power-law trend.

The Core Issue: Theoretical frameworks struggle to explain why CE scaling fails at large scales, and practitioners lack a reliable metric for extrapolating performance to massive models.
The Hypothesis: The authors hypothesize that the CE loss itself does not truly scale; rather, a specific hidden component within it drives the scaling behavior, while other components remain invariant or noisy, causing the observed breakdown in large models.

2. Methodology

The paper introduces a novel mathematical decomposition of the Cross-Entropy loss to isolate its underlying components.

A. Rank-Based Error (RBE)

Instead of relying on raw probability scores (which are sensitive to temperature scaling and sampling), the authors define Rank-Based Error (RBE) as the rank position of the ground-truth token in the model's prediction list.

$RBE(v_i) = \sum_{v \in V} \mathbb{1}\{s_v > s_{v_i}\}$
This metric is robust to post-processing techniques that alter probability magnitudes but preserve token ordering.

B. Mathematical Decomposition

Using the RBE, the authors decompose the Cross-Entropy loss ( $L_{CE}$ ) into three distinct terms:
$L_{CE} = \underbrace{-\sum_e p_e \log p_e}_{\text{Error-Entropy (EE)}} + \underbrace{\sum_e p_e \log \frac{p_e}{q_e}}_{\text{Self-Alignment (SA)}} - \underbrace{\log C}_{\text{Confidence}}$

Error-Entropy (EE): The Shannon entropy of the RBE distribution ( $p_e$ ). It measures how concentrated the model's errors are. Minimizing EE forces the correct token to the top of the ranking.
Self-Alignment (SA): The KL divergence between the RBE distribution ( $p_e$ ) and the normalized score distribution ( $q_e$ ). It measures how well the model's confidence scores align with its actual error distribution.
Confidence: The logarithm of the norm of prediction scores ( $C$ ). It reflects the magnitude of the probability scores assigned to tokens.

C. Experimental Setup

Models: 32 models spanning five orders of magnitude in size (from ~14M to ~70B parameters), including families like Pythia, GPT-2, LLaMA, Mistral, OPT, and Qwen.
Datasets: Wikipedia, C4, and GitHub (Pile).
Analysis: Both qualitative (training dynamics) and quantitative (power-law fitting) analyses were conducted.

3. Key Contributions

Novel Decomposition: The paper provides the first exact mathematical decomposition of Cross-Entropy into Error-Entropy, Self-Alignment, and Confidence, grounded in ranking theory.
Identification of the Scaling Driver: The authors identify Error-Entropy as the sole component that follows a robust power-law scaling law.
Explanation of Scaling Breakdown: The paper explains why the CE scaling law fails at large scales:
- In small models, Error-Entropy dominates the total loss (approx. 80–90%), making the total loss appear to follow a clean power law.
- In large models, the proportion of Error-Entropy decreases, while the non-scaling components (Self-Alignment and Confidence) become dominant, causing the total CE loss to deviate from the power law.

4. Key Results

Scaling Behavior:
- Error-Entropy: Decreases linearly in a log-log plot with model size, exhibiting a higher $R^2$ fit (often >0.97) than the original Cross-Entropy.
- Self-Alignment: Shows no consistent power-law trend; in many cases, it increases or fluctuates with model size.
- Confidence: Displays high variance and lacks a consistent scaling pattern.
Quantitative Evidence:
- Across all datasets and model families, Error-Entropy consistently achieves the smallest difference in scaling exponent ( $\Delta$ ) compared to Cross-Entropy, confirming it is the true driver.
- The $R^2$ values for Error-Entropy are consistently higher than those for Cross-Entropy, indicating a tighter fit to the power law.
Training Dynamics:
- During training, the model first minimizes Error-Entropy (learning to rank the correct token).
- Only after Error-Entropy is largely minimized does the model begin to optimize Self-Alignment and Confidence.
The "Breakdown" Explained: As models grow, the relative contribution of Error-Entropy to the total loss drops (e.g., from ~90% in small models to lower percentages in large models). The increasing dominance of non-scaling terms (SA and Confidence) dilutes the power-law signal, causing the observed slowdown in CE scaling.

5. Significance and Implications

Theoretical Insight: The work shifts the focus from "Cross-Entropy Scaling" to "Error-Entropy Scaling," offering a more accurate theoretical description of how LLMs improve. It links LLM training dynamics to Information-Theoretic Learning (ITL) principles regarding error entropy minimization.
Practical Application:
- Better Prediction: Using Error-Entropy instead of raw CE allows for more reliable extrapolation of performance for future, larger models.
- Training Objectives: The authors propose a modified loss function ( $L_\lambda = CE + \lambda \cdot \text{Confidence}$ ) to penalize the non-scaling Confidence term, potentially shifting optimization focus back toward Error-Entropy.
- Understanding Mechanisms: The decomposition provides a fine-grained view of model behavior, distinguishing between "learning to rank correctly" (EE) and "learning to be confident" (Confidence).

In conclusion, the paper argues that the Cross-Entropy Scaling Law is an emergent phenomenon driven by the Error-Entropy component. As models scale, the diminishing influence of this component reveals the non-scaling nature of the other loss terms, explaining the observed saturation in performance gains.