The Big Problem: How "Complicated" is Your Model?

Imagine you are a chef trying to judge how complex a recipe is.

The Old Way: You might just count the number of ingredients (parameters). But a recipe with 50 spices might actually be a simple dish if all the spices taste the same. Conversely, a recipe with only 3 ingredients could be incredibly complex if the chef has to juggle them in a very specific, delicate way.
The Current Mess: In machine learning, scientists have tried to measure "complexity" using things like the number of parameters, the "Vapnik-Chervonenkis dimension" (a very hard math concept), or "effective degrees of freedom." The problem is that these methods are either too rough (like just counting ingredients) or so hard to calculate that they are useless in practice.

The authors of this paper, Oskar Allerbo and Thomas B. Schön, want to fix this. They propose a new, easy-to-calculate, and mathematically solid way to measure complexity called Gradient Alignment Complexity (GAC).

The New Idea: The "Dance Floor" Analogy

To understand GAC, imagine the model is a dancer, and the "gradients" are the directions the dancer is facing when they move.

The Setup: The model looks at different inputs (different songs on the dance floor). For every song, the model has a specific "direction" it wants to move in to learn the data.
Simple Model (Low Complexity): If the model is very simple, it reacts to every song in the exact same way. It faces the same direction no matter what music plays. All its "dance moves" are perfectly aligned. It has very little freedom.
- Analogy: A robot that only knows one dance move. No matter the song, it does the same thing. It's simple, but not very flexible.
Complex Model (High Complexity): If the model is very complex, it reacts differently to every song. For one song, it faces North; for another, it faces South; for a third, it spins wildly. Its "dance moves" are all over the place and point in totally different directions.
- Analogy: A jazz improviser who changes their style completely for every note. They have total freedom to move anywhere.

The GAC Measure: The authors simply measure how much these "dance moves" (gradients) align with each other.

If they all point the same way (high alignment) $\rightarrow$ Low Complexity.
If they point in random, independent directions (low alignment) $\rightarrow$ High Complexity.

Why This is a Big Deal

The paper claims this new measure is special for three main reasons:

It Works for Everyone: Whether you are using a simple polynomial equation, a decision tree, a random forest, or a neural network, this measure works. It doesn't care what "flavor" of model you are using.
It Measures the "Machine," Not Just the "Output": Sometimes a complex machine (like a super-computer) is used to do a very simple task (like adding 2+2). Old measures might say the machine is simple because the result is simple. The GAC looks at the machine itself. It says, "Hey, even though you're doing a simple task right now, you have the potential to do very complex things because your internal parts are so flexible."
It Generalizes Old Rules: The authors prove that their new measure naturally turns into the old, familiar rules when you apply them to specific models:
- For Polynomials, it acts like the "degree" (how high the power goes).
- For Decision Trees, it acts like the "number of splits" (how many branches).
- For Random Forests, it acts like the "number of trees."
- For K-Nearest Neighbors, it acts like the "number of neighbors."

Solving the "Double Descent" Mystery

There is a famous phenomenon in AI called Double Descent. Usually, as you make a model more complex, it gets better at learning, then worse (overfitting), and then—surprisingly—gets better again if you make it even more complex.

Scientists have been arguing about why this happens. Some say it's because the model is getting too big; others say it's an illusion caused by how we measure complexity.

The authors used their new GAC measure to re-test these experiments:

For "Static" Models: (Models where the structure doesn't change during training, like Random Forests or Random Fourier Features). The GAC confirmed that Double Descent is real. As you add more trees or features, the complexity goes up, and the "second descent" (getting better again) happens exactly when the complexity hits a certain point.
For "Dynamic" Models: (Models like Neural Networks where the features change as they learn). The authors found that Double Descent often disappears when measured with GAC. Why? Because as these models get bigger, they actually become less complex in terms of how they align their gradients. They learn to adapt so well that they stop using their full "complexity potential."

The Takeaway

The authors have built a new "ruler" for measuring machine learning models.

Old Rulers: Were either too blunt (counting parts) or too hard to use (requiring impossible math).
The New GAC Ruler: Looks at how the model's internal "muscles" (gradients) move together. If they move in lockstep, the model is simple. If they move independently, the model is complex.

This tool helps scientists understand why models behave the way they do, particularly the confusing "Double Descent" curve, by providing a clear, consistent definition of what "complexity" actually means across different types of AI.

Technical Summary: A Rigorous, Tractable Measure of Model Complexity

Problem Statement

Accurate assessment of model complexity is fundamental to machine learning tasks such as interpretation, generalization, and model selection. However, existing measures suffer from significant limitations:

Heuristic Approaches: Simple metrics like parameter counts or magnitudes provide crude estimates that fail to capture the true capacity of a model.
Model-Specific Hyperparameters: Measures like polynomial degree or kernel length scale do not generalize across different model classes.
Computational Intractability: Rigorous theoretical measures, such as the Vapnik-Chervonenkis dimension (VCD) and Rademacher complexity (RMC), are often impossible to compute in practice.
Function vs. Model Complexity: There is a critical, often overlooked distinction between the complexity of a specific learned function (e.g., Effective Number of Parameters, ENP) and the complexity of the model class itself. A complex model can generate a simple function (e.g., by setting parameters to zero), yet standard metrics often conflate the two.

Furthermore, the lack of a universally accepted, computable complexity measure complicates the interpretation of the "double descent" phenomenon, where generalization error decreases as model complexity increases beyond the interpolation threshold.

Methodology

The authors propose the Gradient Alignment Complexity (GAC), a model-agnostic measure based on the alignment of model gradients across different inputs.

Definition

For a parametric model $\hat{f}(x, \hat{\theta})$ with parameters $\hat{\theta} \in \mathbb{R}^p$ , let $\phi(x, \hat{\theta}) = \nabla_{\hat{\theta}} \hat{f}(x, \hat{\theta})$ denote the gradient with respect to the parameters at input $x$ . The GAC, denoted $K(\hat{f})$ , is defined as:

$K(\hat{f}) := 1 - \mathbb{E}_{x,x'} \left[ \left( \frac{\phi(x, \hat{\theta})^\top \phi(x', \hat{\theta})}{\|\phi(x, \hat{\theta})\| \cdot \|\phi(x', \hat{\theta})\|} \right)^2 \right]$

This formulation utilizes the squared cosine difference between gradients at two distinct inputs $x$ and $x'$ .

Interpretation: The term inside the expectation represents the squared cosine of the angle between gradients. If gradients are highly aligned (parallel), the model has less freedom to fit diverse data patterns, indicating lower complexity. If gradients are orthogonal (independent), the model is highly flexible.
Generalization: For multivariate outputs (e.g., classification), the dot product is replaced by the Frobenius inner product of the Jacobians.
Empirical Calculation: For a dataset $\{x_i\}_{i=1}^n$ , the expectation is replaced by a sample average over pairs $i \neq j$ .

Theoretical Connections

The authors establish that GAC is mathematically equivalent to:

Normalized Linear Entropy: The GAC equals the normalized linear entropy of the normalized Neural Tangent Kernel (NTK) matrix.
NTK Similarity: It measures the similarity introduced by the model's kernel; higher similarity implies a simpler model.

Crucially, for constant-feature models (where $\hat{f}(x, \hat{\theta}) = \hat{\theta}^\top \phi(x)$ and $\phi(x)$ does not depend on $\hat{\theta}$ ), the GAC depends only on the feature expansion $\phi(x)$ , not the learned parameters. Thus, it measures model complexity rather than function complexity. For non-constant feature models (e.g., deep neural networks), the GAC can be aggregated over training steps weighted by loss reduction.

Key Contributions and Results

1. Generalization of Existing Complexity Metrics

The paper proves that GAC naturally generalizes standard complexity hyperparameters for various model classes:

Polynomial Regression: GAC increases strictly with the polynomial degree $p$ .
Matérn Kernels (Gaussian/Laplace): GAC decreases strictly with the kernel length scale $l$ .
k-Nearest Neighbors (kNN): GAC decreases strictly with the number of neighbors $\kappa$ .
Decision Trees: GAC increases strictly with the number of splits (or leaves).
Random Forests: The complexity of an ensemble is shown to be the sum of the single tree complexity and a term dependent on the number of trees and their correlation.

2. Behavior with Respect to Data and Hyperparameters

Dimensionality and Variance: GAC increases with input dimensionality $d$ and input variance $\sigma^2$ .
Sample Size Independence: For parametric models with constant features, GAC is independent of the sample size $n$ . This contrasts with ENP and its generalizations (GENP-V, GENP-RX), which often exhibit non-monotonic behavior or depend heavily on $n$ .
Robustness: Unlike ENP, which can be influenced by regularization strength (e.g., a highly regularized complex model may appear simple under ENP), GAC correctly identifies the underlying model complexity regardless of the specific learned function or regularization.

3. Insights into Double Descent

The authors revisit the double descent phenomenon using GAC as the complexity metric:

Constant-Feature Models: For Random Fourier Features and Random Forests, double descent persists when complexity is measured by GAC.
Non-Constant Feature Models: For Neural Networks and Gradient Boosting, the double descent phenomenon often disappears or becomes less distinct when measured by GAC. The authors argue that in these cases, the "complexity" (feature alignment) may actually decrease as model capacity increases, because larger models can adapt more easily to the data without requiring a more complex feature space. This suggests that previous observations of double descent in these models might be artifacts of initialization schemes or the conflation of function complexity with model complexity.

Significance and Claims

The paper claims that GAC provides a mathematically rigorous yet easy-to-compute alternative to existing complexity measures. Its primary significance lies in:

Model-Agnosticism: It is well-defined for any parametric model and kernel-based non-parametric models.
Distinction of Complexity: It successfully separates model complexity from function complexity, particularly for constant-feature models.
Interpretability: It offers a unified framework to compare complexity across disparate model classes (e.g., comparing a decision tree to a kernel regression).
Clarifying Double Descent: By providing a consistent complexity metric, it helps distinguish between genuine double descent behaviors and artifacts arising from how complexity is defined (e.g., via generalization error proxies like GENP-V).

The authors acknowledge limitations, noting that GAC can be computationally expensive for deep neural networks where the NTK is costly to compute, and that the aggregation method for training dynamics (Equation 2) could be refined. However, they posit that GAC offers a substantial improvement in understanding model complexity problems.

A Rigorous, Tractable Measure of Model Complexity