Quantitative convergence of trained single layer neural networks to Gaussian processes

This paper establishes explicit upper bounds on the quadratic Wasserstein distance between trained single-layer neural networks and their Gaussian process limits, demonstrating that the approximation error decays polynomially with network width while accounting for the influence of architectural parameters and training dynamics.

Eloy Mosig, Andrea Agazzi, Dario Trevisan

Published 2026-03-06
📖 6 min read🧠 Deep dive

Here is an explanation of the paper "Quantitative convergence of trained single layer neural networks to Gaussian processes," translated into simple, everyday language with creative analogies.

The Big Picture: From a Chaotic Crowd to a Smooth Wave

Imagine you are trying to predict the weather. You have a super-complex computer model (a Neural Network) with millions of tiny switches (parameters) that you tweak to get the best forecast.

Now, imagine you have a second, much simpler model (a Gaussian Process) that is like a perfectly smooth, mathematical wave. It's elegant, predictable, and easy to calculate, but it doesn't "learn" in the same way the complex model does.

For a long time, mathematicians knew that if you made your complex computer model huge (giving it infinite width), it would eventually start acting exactly like the simple, smooth wave. This is like saying, "If you have enough people in a crowd, their collective behavior becomes a predictable flow."

The Problem: In the real world, we don't have infinite computers. We have finite ones. The big question was: How big does the computer need to be before it acts like the smooth wave? And how close is the approximation while it is actually learning?

This paper answers that question with a ruler. It doesn't just say "it gets close"; it says, "If you double the size of your network, the error drops by this specific amount."


The Core Analogy: The Orchestra vs. The Conductor

To understand the math, let's use an orchestra analogy.

  1. The Neural Network (The Musicians): Imagine a massive orchestra with n1n_1 musicians (the width of the network). Each musician is a bit chaotic. They start with random notes (random initialization). As they play (training), they adjust their notes to match a specific song (the data).
  2. The Gaussian Process (The Conductor's Score): This is the "ideal" version of the song. It's a perfect, mathematical score that describes exactly how the music should sound if the orchestra were infinite.
  3. The Training (The Rehearsal): As the musicians practice (gradient descent), they try to get closer to the perfect song.

What this paper found:
The authors proved that even while the orchestra is rehearsing (training), if there are enough musicians, the sound they produce is mathematically indistinguishable from the Conductor's Score, within a very specific margin of error.

They calculated exactly how much "noise" or "chaos" remains in the orchestra's sound compared to the perfect score. They found that as you add more musicians (increase the width), the noise shrinks rapidly, following a specific rule: The error is roughly proportional to log(musicians)musicians\frac{\log(\text{musicians})}{\text{musicians}}.

Key Concepts Explained Simply

1. The "Infinite Width" Limit

Think of this as the "Magic Number." If you had an infinite number of neurons, the neural network would stop being a complex, messy machine and become a simple, smooth mathematical function (a Gaussian Process). This makes it easy to analyze, like studying a calm lake instead of a stormy ocean.

2. The "NTK" (Neural Tangent Kernel)

This is the rulebook that tells the infinite network how to learn. It's like a map that shows the orchestra exactly how to move from the starting note to the final song. The paper shows that for wide networks, the real training path follows this map very closely.

3. The "Wasserstein Distance" (The Ruler)

This is the fancy math term for "how different are these two things?"

  • Imagine you have a pile of sand shaped like a mountain (the Neural Network's predictions).
  • You have another pile of sand shaped like a perfect cone (the Gaussian Process).
  • The Wasserstein distance is the amount of work (energy) it takes to move the sand from the mountain shape to the cone shape.
  • The Paper's Result: They calculated exactly how much "work" is needed. They found that for a wide network, this work is very small and gets smaller as the network gets wider.

4. Training Time (The "t" factor)

One of the paper's biggest contributions is looking at the network while it is learning, not just at the start.

  • Analogy: Imagine watching a clay sculpture being made. At the start, it's a rough blob. As the artist works, it becomes smoother.
  • The authors showed that even as the artist (the training algorithm) works for a long time, the sculpture stays very close to the "ideal" mathematical shape, provided the artist doesn't work for too long (specifically, not longer than a certain polynomial time relative to the network size).

Why Does This Matter?

1. Trusting the "Black Box"
Neural networks are often called "black boxes" because we don't know exactly how they think. This paper gives us a way to peek inside. If a network is wide enough, we can trust that its behavior is predictable and follows the rules of the simpler Gaussian Process. This helps us understand why deep learning works.

2. Safety and Uncertainty
In fields like self-driving cars or medical diagnosis, we need to know: "How sure are you?"
Because the Gaussian Process is mathematically well-understood, we can use it to estimate the "uncertainty" of the real neural network. If the network is wide enough, the paper guarantees that the uncertainty estimates are accurate.

3. Designing Better AI
The paper tells engineers exactly how wide a network needs to be to get a certain level of accuracy. It's like a recipe: "If you want the error to be less than 1%, you need at least XX neurons." This prevents wasting money on networks that are too small (and inaccurate) or too big (and expensive).

The "Catch" (Limitations)

The authors are honest about the limits:

  • The "Bad Event": There is a tiny, tiny chance that the network gets stuck in a weird, chaotic state (like an orchestra member playing the wrong instrument loudly). The math accounts for this, but it means the guarantee is "almost always true," not "100% true."
  • Time Limits: If you train the network for an incredibly long time, it might eventually drift away from the simple mathematical rules and start learning complex, non-linear features that the simple model can't describe. The paper quantifies exactly how long you can train before this happens.

Summary

This paper is a bridge between theory and reality.

  • Theory says: "Infinite networks are perfect and predictable."
  • Reality says: "We have finite networks."
  • This Paper says: "Here is the exact formula for how close your finite network is to the perfect one, how the error shrinks as you add more neurons, and how long you can train it before the math breaks down."

It turns a vague promise ("big networks are good") into a precise engineering guideline ("big networks are good, and here is the math to prove it").