Quantitative convergence of trained single layer neural networks to Gaussian processes

Here is an explanation of the paper "Quantitative convergence of trained single layer neural networks to Gaussian processes," translated into simple, everyday language with creative analogies.

The Big Picture: From a Chaotic Crowd to a Smooth Wave

Imagine you are trying to predict the weather. You have a super-complex computer model (a Neural Network) with millions of tiny switches (parameters) that you tweak to get the best forecast.

Now, imagine you have a second, much simpler model (a Gaussian Process) that is like a perfectly smooth, mathematical wave. It's elegant, predictable, and easy to calculate, but it doesn't "learn" in the same way the complex model does.

For a long time, mathematicians knew that if you made your complex computer model huge (giving it infinite width), it would eventually start acting exactly like the simple, smooth wave. This is like saying, "If you have enough people in a crowd, their collective behavior becomes a predictable flow."

The Problem: In the real world, we don't have infinite computers. We have finite ones. The big question was: How big does the computer need to be before it acts like the smooth wave? And how close is the approximation while it is actually learning?

This paper answers that question with a ruler. It doesn't just say "it gets close"; it says, "If you double the size of your network, the error drops by this specific amount."

The Core Analogy: The Orchestra vs. The Conductor

To understand the math, let's use an orchestra analogy.

The Neural Network (The Musicians): Imagine a massive orchestra with $n_1$ musicians (the width of the network). Each musician is a bit chaotic. They start with random notes (random initialization). As they play (training), they adjust their notes to match a specific song (the data).
The Gaussian Process (The Conductor's Score): This is the "ideal" version of the song. It's a perfect, mathematical score that describes exactly how the music should sound if the orchestra were infinite.
The Training (The Rehearsal): As the musicians practice (gradient descent), they try to get closer to the perfect song.

What this paper found:
The authors proved that even while the orchestra is rehearsing (training), if there are enough musicians, the sound they produce is mathematically indistinguishable from the Conductor's Score, within a very specific margin of error.

They calculated exactly how much "noise" or "chaos" remains in the orchestra's sound compared to the perfect score. They found that as you add more musicians (increase the width), the noise shrinks rapidly, following a specific rule: The error is roughly proportional to $\frac{\log(\text{musicians})}{\text{musicians}}$ .

Key Concepts Explained Simply

1. The "Infinite Width" Limit

Think of this as the "Magic Number." If you had an infinite number of neurons, the neural network would stop being a complex, messy machine and become a simple, smooth mathematical function (a Gaussian Process). This makes it easy to analyze, like studying a calm lake instead of a stormy ocean.

2. The "NTK" (Neural Tangent Kernel)

This is the rulebook that tells the infinite network how to learn. It's like a map that shows the orchestra exactly how to move from the starting note to the final song. The paper shows that for wide networks, the real training path follows this map very closely.

3. The "Wasserstein Distance" (The Ruler)

This is the fancy math term for "how different are these two things?"

Imagine you have a pile of sand shaped like a mountain (the Neural Network's predictions).
You have another pile of sand shaped like a perfect cone (the Gaussian Process).
The Wasserstein distance is the amount of work (energy) it takes to move the sand from the mountain shape to the cone shape.
The Paper's Result: They calculated exactly how much "work" is needed. They found that for a wide network, this work is very small and gets smaller as the network gets wider.

4. Training Time (The "t" factor)

One of the paper's biggest contributions is looking at the network while it is learning, not just at the start.

Analogy: Imagine watching a clay sculpture being made. At the start, it's a rough blob. As the artist works, it becomes smoother.
The authors showed that even as the artist (the training algorithm) works for a long time, the sculpture stays very close to the "ideal" mathematical shape, provided the artist doesn't work for too long (specifically, not longer than a certain polynomial time relative to the network size).

Why Does This Matter?

1. Trusting the "Black Box"
Neural networks are often called "black boxes" because we don't know exactly how they think. This paper gives us a way to peek inside. If a network is wide enough, we can trust that its behavior is predictable and follows the rules of the simpler Gaussian Process. This helps us understand why deep learning works.

2. Safety and Uncertainty
In fields like self-driving cars or medical diagnosis, we need to know: "How sure are you?"
Because the Gaussian Process is mathematically well-understood, we can use it to estimate the "uncertainty" of the real neural network. If the network is wide enough, the paper guarantees that the uncertainty estimates are accurate.

3. Designing Better AI
The paper tells engineers exactly how wide a network needs to be to get a certain level of accuracy. It's like a recipe: "If you want the error to be less than 1%, you need at least $X$ neurons." This prevents wasting money on networks that are too small (and inaccurate) or too big (and expensive).

The "Catch" (Limitations)

The authors are honest about the limits:

The "Bad Event": There is a tiny, tiny chance that the network gets stuck in a weird, chaotic state (like an orchestra member playing the wrong instrument loudly). The math accounts for this, but it means the guarantee is "almost always true," not "100% true."
Time Limits: If you train the network for an incredibly long time, it might eventually drift away from the simple mathematical rules and start learning complex, non-linear features that the simple model can't describe. The paper quantifies exactly how long you can train before this happens.

Summary

This paper is a bridge between theory and reality.

Theory says: "Infinite networks are perfect and predictable."
Reality says: "We have finite networks."
This Paper says: "Here is the exact formula for how close your finite network is to the perfect one, how the error shrinks as you add more neurons, and how long you can train it before the math breaks down."

It turns a vague promise ("big networks are good") into a precise engineering guideline ("big networks are good, and here is the math to prove it").

Here is a detailed technical summary of the paper "Quantitative convergence of trained single layer neural networks to Gaussian processes" by Eloy Mosig García, Andrea Agazzi, and Dario Trevisan.

1. Problem Statement

Deep neural networks (DNNs) in the overparameterized regime are often analyzed using the Neural Tangent Kernel (NTK) framework. This framework posits that as the width of a neural network tends to infinity, its training dynamics under gradient descent converge to those of a linearized model governed by a fixed kernel, and its output distribution converges to a Gaussian Process (GP).

While qualitative convergence (convergence in distribution) is well-established for both initialization and training, there is a significant gap in quantitative convergence guarantees for trained networks at finite width. Existing literature provides explicit error bounds primarily for the initialization regime ( $t=0$ ) or relies on asymptotic arguments that do not specify the rate of decay with respect to network width ( $n_1$ ) during training ( $t > 0$ ).

The Core Problem: How can we rigorously bound the discrepancy between the output distribution of a finite-width, trained shallow neural network and its infinite-width Gaussian process approximation at any arbitrary training time $t \geq 0$ ?

2. Methodology

The authors employ a probabilistic and analytical approach to derive explicit upper bounds on the squared 2-Wasserstein distance ( $W_2^2$ ) between the network output and the limiting GP.

Key Technical Components:

Model Setup: The study focuses on fully connected, single-hidden-layer (shallow) neural networks with width $n_1$ , input dimension $n_0$ , and standard Gaussian initialization. The network is trained via continuous-time gradient descent on Mean Squared Error (MSE) loss.
Decomposition Strategy: The proof utilizes the triangle inequality to decompose the total error into two components:
$W_2(f(x; \theta_t), G_t(x)) \leq W_2(f(x; \theta_t), f^{\text{lin}}(x; \theta_t)) + W_2(f^{\text{lin}}(x; \theta_t), G_t(x))$
1. Linearization Error: The distance between the actual non-linear network $f$ and its linearized counterpart $f^{\text{lin}}$ (the NTK regime approximation).
2. Gaussian Approximation Error: The distance between the linearized network $f^{\text{lin}}$ and the limiting Gaussian Process $G_t$ .
Handling "Good" and "Bad" Events:
- The authors partition the parameter space into a "good" event $S$ (where parameters concentrate and the empirical NTK is well-conditioned) and a "bad" event $S^c$ (tail events).
- On $S$ , they use quenched estimates (conditional on initialization) to bound the linearization error, relying on the stability of the NTK and Lipschitz properties of the activation function.
- On $S^c$ , they utilize concentration inequalities to show that the probability of this event decays exponentially with width, while the error bounds grow polynomially. The product of these terms ensures the total contribution of the "bad" event vanishes as $n_1 \to \infty$ .
Time-Dependent Analysis:
- The authors explicitly model the training dynamics using the operator $I_t(B) = (1 - e^{-Bt})B^{-1}$ , which arises from the analytical solution of the linearized gradient flow.
- They derive bounds that account for the growth of parameters over time, showing that the error remains controlled even as $t$ increases, provided $t$ grows polynomially with $n_1$ .

3. Key Contributions

First Quantitative Bounds for Trained Networks: The paper extends previous quantitative results (which were limited to initialization) to the full training trajectory ( $t > 0$ ).
Explicit Convergence Rates: The authors provide an explicit upper bound for the squared Wasserstein distance:
$W_2^2(f(x; \theta_t), G_t(x)) = O\left( \frac{\log n_1}{n_1} \right)$
This demonstrates polynomial decay with respect to the hidden layer width $n_1$ .
Time-Dependent Analysis: The results hold for training times $t$ that grow polynomially with $n_1$ . The bound includes a term dependent on time ( $t^8$ in the worst-case "bad" event analysis), but this can be made negligible by choosing sufficiently large constants or restricting the time horizon.
Rigorous Assumptions: The main theorem relies on mild assumptions:
- Gaussian initialization.
- Positive definiteness of the limiting kernel (guaranteed for general position data and non-polynomial activations).
- Lipschitz and bounded activation functions (e.g., sigmoid, tanh).
- A technical condition (Assumption 4) ensuring the empirical kernel does not deviate too far from the limiting kernel, which holds for sufficiently overparameterized networks.

4. Main Results

Theorem 3.4 (Main Theorem):
Under mild assumptions, for any test point $x$ and training time $t \geq 0$ , there exist constants $a_1, a_2$ such that:
$W_2^2(f(x; \theta_t), G_t(x)) \leq \sqrt{ \frac{a_1 \log n_1}{(\lambda_{\min}^\infty)^3 n_1 n_0} + \frac{a_2 n_0}{(\lambda_{\min}^\infty)^r n_1^{r/4}} \frac{1}{1+t^8} }$

Interpretation: The error decays as $O(\sqrt{\frac{\log n_1}{n_1}})$ . The dependence on time ( $t^8$ ) in the denominator of the second term suggests that for fixed $n_1$ , the error grows with time, but for fixed $t$ , the error vanishes as width increases.
Implication: Finite-width networks trained via gradient descent are quantitatively close to their Gaussian process counterparts, validating the NTK approximation for practical, finite-width settings.

Numerical Experiments:
The authors validate their theory using PyTorch and the neural-tangents library.

Experiment 1: Visualizes that the distribution of trained networks (widths 700 and 1000) closely matches the mean and confidence intervals of the limiting GP.
Experiment 2: Plots the empirical $W_2$ distance against network width ( $n_1 \in [2, 256]$ ). The results show a clear power-law decay consistent with the theoretical $O(n_1^{-1/2})$ rate (since $W_2^2 \sim n_1^{-1}$ ).

5. Significance and Limitations

Significance:

Bridging Theory and Practice: This work provides the rigorous mathematical justification needed to apply infinite-width theoretical insights (like uncertainty quantification via GPs) to real-world, finite-width deep learning models.
Uncertainty Quantification: By bounding the Wasserstein distance, practitioners can estimate how much the predictions of a finite network deviate from the idealized GP, enabling safer deployment of theoretical uncertainty estimates.
Training Dynamics: It clarifies how architectural parameters (width, depth) and training duration influence the validity of the linear approximation, helping diagnose when the NTK regime breaks down.

Limitations and Future Directions:

Time Uniformity: The current bound is not uniform in time for $t \to \infty$ . The $t^8$ factor suggests the bound may loosen for extremely long training times, potentially reflecting a transition from the NTK regime to a "feature-learning" regime.
Activation Functions: The main theorem assumes bounded, Lipschitz activations (excluding ReLU). However, the authors conjecture the result holds for ReLU based on numerical experiments.
Depth: The analysis is currently restricted to shallow (single-hidden-layer) networks. Extending these quantitative bounds to deep networks is a proposed future direction.
Test Point Dependence: The constants in the bound depend on the specific test point $x$ . Future work aims to derive locally uniform bounds over the input space.

In summary, this paper establishes a rigorous quantitative foundation for the NTK theory in the context of trained networks, proving that finite-width networks converge to Gaussian processes at a rate of $O(\sqrt{\log n_1 / n_1})$ in the 2-Wasserstein metric.