Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

Imagine you are trying to teach a robot to recognize cats and dogs. You have two main ways to look at how the robot learns:

The "Scorecard" View: Did it get the right answer? (High accuracy = Good learning).
The "Brain Structure" View: How did it rewire its internal connections to get there? (Did it simplify its thinking, or did it just memorize every single detail?).

For a long time, scientists assumed that if the robot got a high score, it must have developed a "rich" and efficient brain structure. This paper argues that assumption is wrong. You can get a high score by memorizing (lazy learning), or you can get a low score by overthinking (rich learning).

The authors introduce a new tool called DLR (Dynamical Low-Rank measure) to measure how the robot thinks, completely ignoring whether it got the answer right or wrong.

Here is a breakdown of their ideas using simple analogies:

1. The Problem: The "Rich" vs. "Lazy" Trap

In machine learning, there are two modes of learning:

Lazy Training: The robot barely changes its internal brain. It just tweaks the final "decision button" (the last layer) to fit the data. It's like a student who doesn't study the textbook but just memorizes the answer key for the specific test questions.
Rich Training: The robot fundamentally reorganizes its internal features. It learns the concept of a cat or dog. It's like a student who actually reads the book, understands the biology, and can identify a cat even if it's wearing a hat.

The Catch: Usually, we think "Rich = Good." But this paper shows that sometimes, being "Rich" (overthinking) makes you perform worse on a specific test, while being "Lazy" (memorizing the right features) makes you perform better.

2. The Solution: A New "Brain Scan" (DLR)

Previously, to measure if a robot was learning "richly," scientists had to look at how much the robot's brain changed compared to its starting point, or how complex its math was. These methods were slow, expensive, and often confused by the robot's final score.

The authors created DLR, a new metric that acts like a structural MRI scan of the robot's brain.

How it works: It looks at the "features" (the internal signals) the robot uses right before making a decision.
The Analogy: Imagine a chef making a soup.
- Rich Dynamics (Low DLR): The chef uses only 3 essential ingredients (e.g., salt, pepper, tomato) to create a complex flavor. The recipe is simple, efficient, and focused.
- Lazy Dynamics (High DLR): The chef dumps in 50 different ingredients, hoping the flavor works out. The recipe is messy and unfocused.
The Magic: DLR measures how many ingredients are actually doing the work. It doesn't care if the soup tastes good (accuracy); it only cares if the recipe is efficient (richness).

3. Key Discoveries (The "Aha!" Moments)

A. Richness $\neq$ Success
The authors ran an experiment where they gave the robot a "trick" test.

Scenario: They trained the robot on pictures where the real image was the clue, but the labels (the text saying "cat" or "dog") were hidden in the first 10 pixels of the image.
Result:
- The Rich robot (which tried to understand the whole picture) got confused by the hidden labels and failed the test.
- The Lazy robot (which just looked at the specific pixels where the labels were) got a perfect score.
Lesson: Being "smart" (rich dynamics) doesn't always mean you will win the game. Sometimes, a simple, focused approach wins.

B. The "Grokking" Mystery
"Grokking" is a phenomenon where a robot suddenly goes from failing a math problem to solving it perfectly after a long time of training.

Using DLR, the authors showed that this sudden jump happens exactly when the robot switches from "Lazy" (memorizing) to "Rich" (understanding the pattern).
This proves that DLR can detect when a robot is truly learning, even before the test scores improve.

C. The Secret Sauce: Batch Normalization
They tested a common tool called "Batch Normalization" (a technique to stabilize training).

Without it: The robot was "Lazy" and performed poorly.
With it: The robot became "Rich" and performed much better.
Why it matters: This helps explain why this tool works. It forces the robot to reorganize its brain into a more efficient, rich structure.

4. The Visualization: Seeing the Invisible

To make this easier to understand, the authors created a visual tool. Imagine a graph showing the "importance" of every single neuron in the robot's brain.

In a Rich Robot: The graph looks like a steep mountain. Only the top 10 neurons are huge; the rest are tiny. The robot is focused.
In a Lazy Robot: The graph looks like a gentle hill. Hundreds of neurons are all slightly active. The robot is scattered and unfocused.

Summary

This paper gives us a new way to look at AI. Instead of just asking, "Did it get the answer right?" we can now ask, "How efficiently did it think?"

Old Way: Check the test score.
New Way (DLR): Check the "recipe" the AI used.

This tool helps researchers understand why some AI models learn fast, why some get stuck, and how to build robots that don't just memorize, but actually understand. It separates the "richness" of the learning process from the "score" of the result, showing us that sometimes, the most efficient path to a solution isn't the one that looks the smartest on paper.

1. Problem Statement

In deep learning, feature learning is often analyzed through two lenses: representation quality (how well features support downstream tasks like classification) and dynamical richness (the non-linear transformation of features during training, often called the "rich regime" vs. the "lazy regime").

The Core Issue: There is a lack of a robust, independent metric to measure dynamical richness. Currently, researchers often use predictive accuracy as a proxy for richness. However, the paper demonstrates that rich dynamics do not always correlate with better generalization (e.g., rich dynamics can sometimes lead to poor performance if the model overfits to spurious features).
Limitations of Existing Metrics:
- Neural Tangent Kernel (NTK) changes: Theoretically sound but computationally infeasible for large models ( $O(N^2)$ where $N$ is total parameters).
- Neural Collapse (NC) metrics: Depend heavily on class labels and perfect classification, making them unstable during training or for regression tasks.
- Parameter norms/Kernel deviation: Can be misleading (e.g., weight decay alone can reduce parameter norms without inducing true rich dynamics).

The authors aim to introduce a computationally efficient, performance-independent metric that quantifies dynamical richness based on the low-rank bias inherent in rich training dynamics.

2. Methodology

A. Theoretical Framework: Feature Kernel and Low-Rank Bias

The authors define a Feature Kernel Operator $T$ acting on the Hilbert space of functions $L^2(X)$ , derived from the penultimate layer features $\Phi(x)$ :
$T = \sum_{k=1}^p |\Phi_k\rangle\langle\Phi_k|$
where $p$ is the width of the last layer.

Rich Dynamics Hypothesis: In a rich regime, gradient descent induces a low-rank bias, meaning the network learns only the minimal number of features ( $C$ , the number of classes) necessary to span the learned function space, rather than utilizing all $p$ available features.
Minimum Projection (MP) Operator ( $T_{MP}$ ): The authors define an ideal operator $T_{MP}$ that projects any function onto the learned function space $\hat{H}$ (spanned by the network's outputs). In a perfectly rich regime, the actual feature kernel $T$ should align perfectly with $T_{MP}$ (up to a constant and a constant function).

B. The Proposed Metric: DLR (Dynamical Low-Rank)

The authors propose DLR (Dynamical Low-Rank) as a measure of the alignment between the actual feature kernel $T$ and the ideal minimum projection operator $T_{MP}$ .

$DLR := 1 - \text{CKA}(T, T_{MP})$

CKA (Centered Kernel Alignment): A normalized similarity measure between two operators/matrices, bounded in $[0, 1]$ .
Interpretation:
- DLR $\approx$ 0: Indicates Rich Dynamics. The features span exactly the learned function space (low-rank bias is active).
- DLR $\approx$ 1: Indicates Lazy Dynamics. The features are misaligned with the learned function, utilizing excess dimensions.
Key Properties:
- Performance Independent: Does not require ground truth labels or test accuracy.
- Computational Efficiency: Requires only $O(npC)$ operations (where $n$ is sample count, $p$ is last-layer width, $C$ is classes). It avoids the $O(N^2)$ cost of full NTK calculations.
- Generalization: Reduces to Neural Collapse as a special case when the network perfectly classifies balanced data, but extends to settings without labels (e.g., regression).

C. Visualization Tool: Eigendecomposition

To support interpretability, the authors introduce a visualization method based on the eigendecomposition of $T$ . They track three metrics:

Target Projection ( $\Pi^*(k)$ ): How well the top $k$ eigenfunctions span the target function space.
Self Projection ( $\hat{\Pi}(k)$ ): How well the top $k$ eigenfunctions span the learned function space (indicates feature utilization).
Relative Eigenvalues ( $\rho_k/\rho_1$ ): The decay rate of feature importance.

Rich Regime Signature: $\hat{\Pi}(C) \approx 1$ (only $C$ features used) and a sharp drop in eigenvalues after $k=C$ .
Lazy Regime Signature: Slow decay of eigenvalues and utilization of many features ( $\hat{\Pi}(k)$ grows slowly).

3. Key Contributions

Introduction of DLR: A lightweight, robust metric for dynamical richness that decouples dynamics from performance.
Theoretical Connection: Proved that if the feature kernel is an MP-operator, it implies Neural Collapse conditions (NC1 and NC2), bridging the gap between dynamical bias and the well-studied phenomenon of neural collapse.
Empirical Validation:
- Demonstrated that DLR correctly identifies lazy-to-rich transitions (e.g., Grokking in modular arithmetic tasks) where accuracy jumps but dynamics shift.
- Showed that DLR is robust against confounding factors like weight decay and target downscaling, where other metrics (like parameter norms or kernel deviation) fail.
New Observations on Training Factors:
- Batch Normalization (BN): Found that BN shifts models (e.g., VGG-16 on CIFAR-100) from a lazy to a rich regime, significantly improving generalization.
- Learning Rate: Confirmed that optimal learning rates induce richer dynamics (lower DLR) compared to suboptimal rates.
- Rich $\neq$ Good: Validated that rich dynamics can exist even with poor generalization (e.g., when labels are shuffled or features are spurious), proving that richness is a property of the dynamics, not necessarily the solution quality.

4. Experimental Results

Grokking: In a 2-layer transformer trained on modular division, DLR dropped significantly (0.51 $\to$ 0.11) at the moment of "grokking" (sudden generalization), capturing the transition to rich dynamics before accuracy fully stabilized.
Target Downscaling: When targets were scaled down (inducing laziness), DLR increased (became "lazier"), while other metrics like parameter norms remained static or behaved inconsistently.
Batch Normalization: On CIFAR-100, VGG-16 without BN had a high DLR (0.66, lazy) and poor test accuracy (21.7%). With BN, DLR dropped to 0.073 (rich) and accuracy jumped to 72.0%.
Label Shuffling: Models trained on shuffled labels still exhibited rich dynamics (low DLR), confirming that the low-rank bias is a strong inductive bias of the optimizer/architecture, independent of the data structure.

5. Significance and Conclusion

Diagnostic Tool: DLR provides a practical, scalable way to diagnose whether a model is learning features (rich) or just fitting a linear map on fixed features (lazy), without needing to wait for convergence or rely on test accuracy.
Theoretical Insight: It reframes Neural Collapse not just as a geometric property of class means, but as a manifestation of low-rank dynamical bias.
Decoupling: The work successfully decouples "dynamical richness" from "representation quality," showing that while they often align, they are distinct phenomena. A model can be rich but fail to generalize (if the bias is wrong), or be lazy and generalize well (if the features are already good).
Future Impact: This metric offers a foundation for theoretical studies on how training hyperparameters (learning rate, batch size, architecture) influence the dynamics of learning, potentially leading to better optimization strategies and architecture designs.

In summary, the paper provides a rigorous, efficient, and label-independent framework to measure the "richness" of neural network training dynamics, resolving ambiguities in prior literature and offering new insights into phenomena like grokking and the role of batch normalization.