What Scales in Cross-Entropy Scaling Law?

This paper challenges the validity of the traditional cross-entropy scaling law at large scales by introducing a novel decomposition that reveals only the "error-entropy" component follows a robust power-law, while the other components remain invariant, thereby explaining the law's breakdown and establishing error-entropy as a more accurate metric for guiding large language model development.

Junxi Yan, Zixi Wei, Qingyao Ai, Yiqun Liu, Jingtao Zhan

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Picture: The "Magic Formula" That Stopped Working

Imagine you are building a giant library of knowledge (a Large Language Model, or LLM). For years, scientists believed in a "Magic Formula" called the Cross-Entropy Scaling Law.

The Old Belief:
The formula said: "If you double the size of your brain (the model) and the amount of books you read (the data), your mistakes will drop by a predictable, steady amount." It was like a recipe that guaranteed better results the more ingredients you added.

The Problem:
Recently, scientists noticed something weird. When they built massive models (the super-giant ones), the "Magic Formula" stopped working. The models got better, but much slower than the recipe predicted. It was like adding more flour to a cake, but the cake stopped rising. Everyone was confused: Why did the rule break?

The Investigation: Taking the Cake Apart

The authors of this paper decided to investigate. They hypothesized that the "Cross-Entropy" score (the measure of mistakes) wasn't a single, solid thing. Instead, they thought it was a mixture of three different ingredients hiding inside.

To find out what was really happening, they invented a new way to look at how a model makes a mistake. Instead of just looking at the probability (a number between 0 and 1), they looked at the Rank (the position on a list).

  • Analogy: Imagine a teacher grading a test.
    • Old Way: The teacher looks at the score: "You got a 92%."
    • New Way (Rank-Based Error): The teacher looks at the ranking: "You got the 3rd best answer out of 100 options."
    • The authors argue that the ranking is a more honest measure of intelligence than the specific number score.

The Three Ingredients (The Decomposition)

They mathematically split the "Mistake Score" into three distinct parts:

  1. Error-Entropy (The "Knowing What's Right" Part):

    • What it is: This measures how well the model knows the correct answer is at the top of the list.
    • The Metaphor: This is the student actually studying and learning the material. As the student gets smarter, they put the right answer at #1 more often.
    • The Finding: This part DOES follow the Magic Formula. It gets better and better as the model grows.
  2. Self-Alignment (The "Confidence Matching" Part):

    • What it is: This measures how well the model's internal confidence matches its actual ranking.
    • The Metaphor: This is the student trying to sound confident. If they think they are right, they say "I'm 100% sure!" If they are wrong, they say "I'm 10% sure."
    • The Finding: This part DOES NOT scale. It stays random and messy, regardless of how big the model gets.
  3. Confidence (The "Arrogance" Part):

    • What it is: This measures how high the probability scores are.
    • The Metaphor: This is the student shouting "I AM RIGHT!" very loudly. A model can be very loud (high confidence) even if it's not actually learning the material faster.
    • The Finding: This part DOES NOT scale. It just fluctuates.

The "Aha!" Moment: Why the Formula Broke

Here is the secret the paper reveals:

  • In Small Models: The "Error-Entropy" (the actual learning) was the only thing that mattered. It made up 90% of the total score. So, when the model grew, the whole score improved perfectly. The Magic Formula worked!
  • In Giant Models: As the models got huge, the "Error-Entropy" (learning) stopped being the only thing. The "Confidence" and "Self-Alignment" parts started taking up a bigger slice of the pie.
    • Since those other two parts don't follow the Magic Formula, they started dragging the whole score down.
    • The Result: The total score stopped improving as fast as expected, not because the model stopped learning, but because the "noise" (confidence and alignment) became too loud.

The Solution: A New Compass

The authors propose we stop looking at the whole "Cross-Entropy" score and start focusing only on Error-Entropy.

  • Old Compass: "Look at the total mistake score." (Confusing and broken for big models).
  • New Compass: "Look only at the Error-Entropy." (This still follows the perfect, predictable path).

Why This Matters

  1. Better Training: If we know that "Confidence" is just noise, we can train models to ignore it and focus purely on getting the ranking right. This could make training more efficient.
  2. Understanding AI: It tells us that the "intelligence" of a model is about ranking the right answer, not just assigning high probability numbers.
  3. Future Predictions: We can now predict how big models will behave much more accurately by ignoring the parts that don't scale.

In Summary:
The paper says the "Magic Formula" didn't actually break; we just stopped looking at the right part of the equation. The "learning" part (Error-Entropy) is still scaling perfectly, but it's getting drowned out by the "confidence" part. If we focus on the learning, the path forward for AI becomes clear again.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →