Original authors: Konstantin Nikolaou, Jonas Scheunemann, Sven Krippendorf, Samuel Tovey, Christian Holm

Published 2026-06-01

📖 5 min read🧠 Deep dive

Original authors: Konstantin Nikolaou, Jonas Scheunemann, Sven Krippendorf, Samuel Tovey, Christian Holm

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Why Bigger Models Learn Better

Imagine you are trying to learn a new language.

Small models are like students who only learn the most obvious, common words (like "hello," "cat," "run"). Once they know these, they stop improving because they can't understand the complex grammar or rare idioms.
Large models are like students who not only know the common words but also keep digging deeper to learn obscure vocabulary, complex sentence structures, and subtle nuances.

This paper asks: Why do larger models keep learning while smaller ones stop?

The authors discovered that larger models have a special ability they call "Spectral Reach." It's like having a longer ladder. While small models can only reach the top rungs (the easy, obvious patterns), large models can climb all the way down to the very bottom rungs (the tiny, hidden, difficult patterns) to keep improving.

The Core Concept: The "Spectral Tail"

To understand this, imagine the learning process as a giant library of books, where each book represents a different pattern in the data.

The Bestsellers (The Head): These are the popular, easy-to-learn patterns. They are loud, clear, and easy to hear. Every model, big or small, learns these first.
The Obscure Archives (The Tail): These are the quiet, faint, and difficult patterns. They are buried deep in the library.

The Problem: As a model trains, it finishes reading the "Bestsellers" first. Once it's done, it needs to move to the "Archives" to keep getting better.

Small models hit a wall. They run out of "brainpower" to read the faint books in the archives. They get stuck.
Large models have a "super-ear." They can hear the faint whispers in the archives. They keep reading, learning the subtle details that others miss. This ability to reach deep into the "spectral tail" is Spectral Reach.

The New Tool: The "Spectral Position" Meter

The authors invented a new tool called Spectral Position (or $\chi_{pos}$ ). Think of this as a GPS tracker for the model's learning journey.

High GPS Value (Close to 1): The model is currently reading the "Bestsellers." It's learning the big, easy patterns.
Low GPS Value (Close to 0): The model has moved deep into the "Archives." It is now learning the tiny, difficult patterns.

What they found:

Time Travel: As training goes on, the GPS value drops. The model naturally moves from easy patterns to hard ones.
The Size Difference: Bigger models drop their GPS value much lower than smaller models. They go deeper into the archives. This explains why they end up with lower errors (better performance)—they simply learned more of the hidden details.

The Secret Ingredient: Feature Learning

You might ask, "Why can big models hear the faint whispers?"

The paper tested this by freezing the "brain" of a model (preventing it from changing its internal features) and only letting the final layer learn.

Frozen Models: These models stopped learning early. They couldn't reach the deep archives.
Active Models: These models kept changing their internal "features" (how they see the world).

The Analogy: Imagine trying to listen to a faint radio station.

A frozen model is like a radio with a broken antenna. No matter how much you turn the volume up, you can't hear the faint station.
A learning model is like a radio that builds a better antenna while you are listening. As it learns, it reshapes its internal structure to amplify those faint signals. This "antenna building" (feature learning) allows the model to sustain its progress even when the signals get very weak.

The "LNP" Decomposition: Breaking Down the Math

The authors created a formula to measure this without needing to do impossible calculations. They broke the learning process into three parts, like a recipe:

Loss Scale ( $\chi_{loss}$ ): How "loud" the mistake is right now. (If the model is wrong, this is high).
Network Scale ( $\chi_{net}$ ): How sensitive the model is to changes. (Big models can build stronger "antennas" here).
Spectral Position ( $\chi_{pos}$ ): The GPS value. Where in the library is the model reading?

The Magic: They found that as the model gets deeper into the "Archives" (Spectral Position drops), the "Network Scale" (the antenna strength) actually increases in big models. This extra strength compensates for the faintness of the signals, allowing the model to keep learning. Small models don't get this boost, so they give up.

Summary of Findings

Learning is a journey: Models start with easy patterns and slowly move to hard, fine-grained details.
Size matters: Bigger models can go further into the "hard details" (the spectral tail) than smaller ones.
Adaptability is key: This ability isn't just about having more memory; it's about the model actively reshaping itself (feature learning) to amplify weak signals.
The Metric: The new "Spectral Position" tool allows scientists to watch this journey in real-time, even for massive models, without needing supercomputers to do impossible math.

In short, bigger models win because they don't stop learning when the easy stuff is done; they have the "reach" to keep digging for the hidden gems that smaller models can't find.

Technical Summary: Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

Problem Statement

Neural scaling laws describe predictable power-law relationships between model size, dataset size, compute, and performance, serving as a cornerstone for developing modern foundation models. However, the mechanisms underpinning these laws remain poorly understood. Existing theoretical explanations often rely on idealized assumptions (e.g., random feature models with frozen representations) or require kernel computations that are infeasible at the scales where scaling laws are observed. Consequently, there is a lack of scalable analysis tools to reveal the underlying spectral dynamics of large-scale training, leaving open the question of how scaling laws emerge in practical deep learning scenarios.

Methodology

To address the measurement bottleneck, the authors introduce the Loss-Network-Position (LNP) decomposition. This framework factors the instantaneous (linearized) loss change into three interpretable components:

Network Scale ( $\chi_{net}$ ): The Frobenius norm of the Jacobian of network outputs with respect to parameters ( $\|\nabla_\theta f\|_F^2$ ), equivalent to the trace of the empirical Neural Tangent Kernel (eNTK). It captures the network's sensitivity to parameter updates.
Loss Scale ( $\chi_{loss}$ ): The squared Euclidean norm of the loss gradient with respect to network outputs ( $\|\nabla_f L\|_2^2$ ), reflecting the magnitude of prediction errors.
Spectral Position ( $\chi_{pos}$ ): A scale-free quantity in the range $[0, 1]$ that indicates which eigenvalues of the eNTK currently drive loss reduction. It is defined as the weighted average of normalized eigenvalues, where weights are determined by the projection of the loss gradient onto the eNTK eigenmodes.

Key Innovation: While computing $\chi_{pos}$ traditionally requires expensive full eNTK construction, the LNP decomposition allows it to be calculated indirectly via the ratio $\chi_{pos} = \delta L / (\chi_{net} \cdot \chi_{loss})$ , where $\delta L$ is the linearized loss change. This enables measurement alongside training with minimal computational overhead (less than 2×) using per-sample gradient magnitudes, avoiding explicit kernel construction.

The authors validate this framework on controlled Random Feature Models (RFMs) with power-law data spectra, where theoretical predictions match empirical measurements. They then apply the diagnostic to scaling experiments involving Llama 2 language models on SimpleStories and CIFAR-5M, as well as Vision Transformers on CIFAR-5M.

Key Contributions and Results

1. Spectral Position Decreases During Training

The authors observe that as training progresses, the spectral position $\chi_{pos}$ decreases by orders of magnitude. This indicates a systematic shift in learning dynamics: the model initially learns from dominant, high-eigenvalue modes (coarse patterns) and progressively shifts focus toward the spectral tail (fine-grained details) as the dominant modes converge and cease to contribute to the loss gradient.

2. Definition of "Spectral Reach"

The paper introduces Spectral Reach as a model's capacity to learn from progressively smaller eigenvalue modes of the eNTK spectrum.

Observation: Larger models achieve lower final values of $\chi_{pos}$ than smaller models.
Interpretation: Smaller models "flatten out," reaching a capacity limit where they can no longer access finer-grained spectral modes. Larger models sustain the downward trajectory, accessing weak spectral signals inaccessible to smaller models. This suggests that larger models achieve lower losses because they can continue refining fine-grained details that smaller models cannot resolve.

3. The Role of Feature Learning

Through linear probing experiments (comparing pre-trained backbones against random, frozen backbones), the authors identify feature learning as a key enabler of spectral reach.

Mechanism: In models with frozen representations (random backbones), $\chi_{net}$ remains constant, and spectral position plateaus. In contrast, feature-learning models exhibit an adaptive increase in $\chi_{net}$ (gradient magnitudes) as training advances.
Compensation: This increase in $\chi_{net}$ acts as a counterweight to the decreasing $\chi_{pos}$ . While $\chi_{pos}$ drops (indicating learning from weaker signals), the growing $\chi_{net}$ amplifies gradient magnitudes, sustaining learning progress where frozen representations would stall. This demonstrates that learned representations reshape the eNTK spectrum to support continued descent into the spectral tail.

4. Validation Across Architectures and Parameterizations

The findings generalize across language models (Llama 2) and vision models (Vision Transformers). Crucially, the authors replicate experiments under maximal-update parameterization (muP), which holds feature-learning intensity constant across different widths. The persistence of the spectral reach ordering under muP confirms that the phenomenon is driven by model capacity rather than width-dependent feature-learning intensity.

Significance and Claims

The paper claims to provide a scalable diagnostic tool that bridges the gap between theoretical spectral explanations of scaling laws and practical deep learning. By demonstrating that larger models achieve lower losses by sustaining learning on weak spectral signals via feature learning, the work offers a mechanistic explanation for neural scaling.

The authors position their findings as a reframing of the optimization question: rather than simply asking "how do we reduce loss?", the focus shifts to "how do we enhance spectral reach?" This perspective suggests concrete avenues for intervention, such as:

Accelerating spectral descent: Through optimizer design (e.g., targeted learning rates, gradient scaling).
Reshaping the spectrum: Through architectural choices or initialization schemes (e.g., muP, He, Xavier) to make subordinate modes more accessible.

The paper concludes modestly, noting that while the LNP decomposition captures first-order effects and exact instantaneous properties, the non-linear correction terms remain unanalyzed. Furthermore, while the results connect spectral position to scale and performance, the causal mechanisms regarding how feature learning specifically restructures the eNTK spectrum require further controlled interventions to be definitively established. The work serves as a foundation for future mode-level analysis of semantic structure and paradigm transitions in training.

Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail