The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

Imagine you are trying to build a skyscraper. In the world of Artificial Intelligence, these skyscrapers are called Deep Residual Networks (ResNets). They are the engines behind many modern AI systems, from recognizing cats in photos to writing code.

Usually, to make these skyscrapers work well, engineers have to carefully tune two main things:

Depth ( $L$ ): How many floors the building has.
Width ( $M$ ): How wide each floor is (how many workers are on each floor).

For a long time, mathematicians and engineers believed that to understand how these buildings behave when they get really tall (infinite depth), you also had to make them really wide (infinite width). They thought you needed both to happen at the same time to see the "true" behavior of the AI.

The Big Discovery:
This paper, written by L´ena¨ıc Chizat, flips that idea on its head. The author proves that you don't need the building to be infinitely wide to understand what happens when it gets infinitely tall.

Even if your building is only one worker wide per floor (extremely narrow), as long as you keep adding floors, the behavior of the whole building settles into a predictable, smooth pattern. It's as if the "hidden width" of the building is actually infinite, regardless of how narrow it physically is.

The Core Metaphor: The "Crowd" vs. The "Average"

To understand the math, let's use a Crowd Metaphor.

Imagine a long hallway (the depth of the network) filled with people (the neurons/units).

The Old View: To predict how the crowd moves, you needed to assume there were millions of people on every step of the hallway (infinite width).
The New View: The author shows that even if there is only one person on each step, as the hallway gets longer and longer, that single person's movement starts to look exactly like the average movement of a massive crowd.

Why? Because of Randomness.
When the building is first constructed, the workers are placed randomly. As the signal travels up the hallway (forward pass) and the instructions come back down (backward pass), the randomness of the initial setup acts like a "stochastic approximation." It's like rolling a die many times; even if you roll it once per floor, the average result over a thousand floors becomes very predictable.

The Two "Modes" of Operation

The paper identifies two different ways these AI buildings can behave, depending on how you scale the "residual" (the connection between floors). Think of this as the stiffness of the building's elevators.

1. The "Maximal Local Update" (MLU) Regime: The Flexible Gym

The Vibe: This is the "sweet spot" for learning.
The Analogy: Imagine a gymnast on a balance beam. Every time they take a step (a training update), they adjust their balance significantly. They are learning features—they are actively changing how they see the world.
The Math: In this regime, the AI is genuinely non-linear. It's learning complex patterns, not just memorizing. The paper proves that if you scale the connections correctly (specifically related to the square root of the embedding dimension), the AI learns efficiently, and the error drops predictably as you add more floors.

2. The "Lazy ODE" Regime: The Stiff Robot

The Vibe: This is the "lazy" mode.
The Analogy: Imagine a stiff, rigid robot on a conveyor belt. When you push it, it barely moves. It doesn't really learn new features; it just slightly tweaks its initial settings. It's like a linear approximation.
The Math: If you scale the connections too aggressively (making the "elevator" too stiff), the AI stops learning features and just acts like a simple linear model. It's stable, but it's not very smart.

The "Phase Diagram": A Map for Engineers

The paper draws a Phase Diagram (Figure 4 in the text). Think of this as a weather map for AI architects.

Green Zone (Sub-critical): The building is too "loose." It behaves like it has no width at all.
Blue Zone (Critical/MLU): This is the Goldilocks Zone. The scaling is just right. The building is narrow, but it learns effectively, just like a massive, wide building would.
Red Zone (Lazy/Explosion): The building is too "stiff" or unstable. It either stops learning or falls apart.

The author's key insight is that the Critical Zone (Blue) is the only place where you get "Maximal Local Updates"—meaning the AI actually learns new things rather than just shuffling its initial random weights.

Why Does This Matter?

Efficiency: You don't need to build massive, wide models to get the benefits of deep learning. You can build narrow, deep models and they will behave just as well, provided you tune the "scaling factors" correctly.
Predictability: The paper gives a precise formula for how much error you will have. It's like saying, "If you add 100 more floors, your prediction error will drop by exactly this much." This helps engineers know exactly how big their model needs to be without wasting money on trial and error.
Simplicity: It unifies two different theories (Neural ODEs and Mean-Field theory) into one simple picture: Depth creates width.

The Takeaway

This paper tells us that in the world of Deep Learning, depth is the new width.

If you build a very deep ResNet with the right scaling, it behaves as if it were infinitely wide, even if it's physically narrow. It's a bit like a single thread of DNA containing the blueprint for a whole human; the information is there, and with the right "training" (gradient descent), the structure unfolds perfectly.

The author has essentially handed architects a new rulebook: Don't worry about making your AI wider; just make it deeper, and tune the elevators correctly.

1. Problem Statement

The paper addresses the theoretical understanding of training dynamics in deep Residual Networks (ResNets) as both the depth ( $L$ ) and the hidden width ( $M$ ) scale. Specifically, it investigates the behavior of ResNets trained via gradient descent (GD) from standard random initializations under various scaling regimes of hyperparameters (learning rates, initialization scales, and residual block multipliers).

Key questions include:

Do infinite-depth ResNets ( $L \to \infty$ ) behave as if they are infinitely wide ( $M \to \infty$ ), even if $M$ is fixed or small?
What are the precise error bounds between the finite-depth/finite-width network and its infinite limit?
How do the embedding dimension ( $D$ ), depth ( $L$ ), and width ( $M$ ) interact to determine whether the network exhibits Maximal Local Updates (MLU) (feature learning) or a Lazy ODE regime (linearized, kernel-like behavior)?

2. Methodology

The authors employ a novel mathematical framework combining stochastic approximation and propagation of chaos.

Neural Mean ODE: Instead of relying on the Neural Tangent Kernel (NTK) or standard Mean-Field limits that require $M \to \infty$ , the authors define a limit model called the Neural Mean ODE. This model is parameterized by a stochastic process $Z(s)$ representing the distribution of parameters at depth $s \in [0,1]$ .
Stochastic Approximation: They view the forward and backward passes of a finite ResNet as a stochastic approximation (Euler/Monte-Carlo scheme) of the continuous Mean ODE. The depth $L$ acts as the time discretization step ( $1/L$ ), and the width $M$ (combined with $L$ ) acts as the sample size for the Monte-Carlo integration.
Propagation of Chaos: A core technical insight is that due to random initialization and the structure of ResNets, the units (neurons) remain asymptotically independent throughout the training dynamics. This "propagation of chaos" allows the finite network to converge to the deterministic limit of the Mean ODE, even when $M$ is small, provided $L$ is large.
Phase Diagram Analysis: The authors analyze the scaling of the residual branch multiplier ( $\alpha$ $α$ ) relative to $L, M,$ $L, M,$ and $D$ $D$ to identify distinct regimes:
- MLU Regime: The network learns features non-linearly.
- Lazy ODE Regime: The network behaves linearly around initialization (similar to NTK).
- Critical Scaling: The boundary between these regimes.

3. Key Contributions

A. The "Hidden Width" Phenomenon

The paper proves that infinite-depth ResNets behave as if they are infinitely wide, regardless of the actual width $M$ .

Result: As $L \to \infty$ , the training dynamics converge to a unique Neural Mean ODE.
Implication: The effective width of the architecture is not just $M$ , but the product $ML$. A network with small $M$ but large $L$ can achieve the same limit dynamics as a wide, shallow network.

B. Tight Error Bounds

The authors derive rigorous, high-probability error bounds between the finite ResNet output and the Neural Mean ODE limit after $k$ gradient steps.

Generic ResNets (MLU Regime): For a residual scale $\Theta(1/LM)$ , the error is bounded by:
$O\left( \frac{1}{L} + \frac{1}{\sqrt{LM}} \right)$
The first term is the depth discretization error (Euler method), and the second is the sampling error (Monte-Carlo), depending on the effective width $ML$.
Lazy ODE Regime: For large residual scales ( $\alpha \to \infty$ ), the network converges to a linearized "Tangent Mean ODE" with a different error bound involving $\alpha$ .

C. Phase Diagram for 2-Layer Perceptron (2LP) Blocks

Focusing on ResNets with 2LP blocks and explicitly tracking the embedding dimension $D$ , the authors identify the necessary and sufficient condition for Maximal Local Updates (MLU).

Critical Scaling: The residual scale must be $O\left( \frac{\sqrt{D}}{LM} \right)$ .
Phase Diagram:
- If the scale is too large ( $\gg \sqrt{D}/LM$ ), the network enters the Lazy ODE regime (feature updates vanish).
- If the scale is too small, the network is sub-critical (behavior similar to zero initialization).
- At the critical scale, the network achieves Maximal Local Feature Updates and maximal feature diversity.
Dimensional Dependency: For the 2LP case, the error bound is refined to:
$O\left( \frac{1}{L} + \sqrt{\frac{D}{ML}} \right)$
This confirms that for practical regimes where $M \approx D$ , the condition $ML \gg D$ is required for convergence.

4. Key Results

Convergence to Neural Mean ODE: The training dynamics of deep ResNets with random initialization converge to a deterministic Neural Mean ODE as $L \to \infty$ , even if $M$ is fixed (e.g., $M=1$ ).
Effective Width: The convergence rate depends on the product $LM$, not just $M$ . This explains why very deep, narrow networks can still learn complex features.
Tightness of Bounds: Theoretical bounds are verified empirically. Experiments show that the error scales exactly as predicted by $1/L$ and $1/\sqrt{LM}$ (or $\sqrt{D/ML}$ ), confirming the bounds are tight.
Feature Learning vs. Lazy Training: The paper provides a precise mathematical criterion (the residual scale) to distinguish between feature learning (MLU) and lazy training. It shows that the "lazy" behavior often observed in theoretical limits is a result of specific scaling choices, not an inherent property of deep networks.

5. Significance

Bridging Theory and Practice: Previous theoretical works often required $M \to \infty$ to derive limits, which contradicts practical setups where $M$ is often comparable to $D$ or even small. This paper shows that the infinite-width limit is a valid approximation for deep networks even with finite width, provided depth is large.
Guidance for Hyperparameter Tuning: The derived phase diagram offers principled guidelines for setting initialization scales and learning rates in deep ResNets and Transformers to ensure feature learning (MLU) rather than lazy kernel behavior.
New Mathematical Perspective: By introducing the "Neural Mean ODE" and utilizing propagation of chaos, the authors provide a rigorous tool to analyze deep architectures without relying on the sequential limits (first width, then depth) used in prior literature.
Implications for Transformers: Since Transformers can be viewed as ResNets with specific block types (Attention + MLP), these results suggest that very deep, narrow Transformer variants (e.g., with few attention heads or small FFN widths) can still exhibit rich feature learning dynamics if the depth is sufficiently large and scaling is correct.

In summary, this paper fundamentally shifts the understanding of deep network scaling, demonstrating that depth can compensate for width, and providing the first tight, quantitative error bounds that capture the interplay between depth, width, and embedding dimension in the training dynamics of ResNets.