How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Imagine you are trying to solve a massive puzzle, but you have way more puzzle pieces than you need. In fact, you have so many pieces that there are millions of different ways to complete the picture perfectly. This is what happens in modern AI when we use "overparameterized" neural networks: there are countless perfect solutions to the training problem.

So, how does the computer decide which solution to pick? It uses an algorithm called Gradient Descent, which is like a hiker trying to find the lowest point in a foggy valley. The hiker doesn't see the whole map; they just take small steps downhill. The "Implicit Bias" of this paper is the study of which specific valley the hiker accidentally ends up in, even though they didn't plan to go there.

Here is the story of what this paper discovered, explained through simple analogies.

1. The Setup: The "ReLU" Hiker

The paper focuses on a specific type of neural network using an activation function called ReLU (Rectified Linear Unit).

The Analogy: Imagine ReLU is a strict gatekeeper. If a signal is positive, the gate opens and lets it through. If the signal is negative, the gate slams shut, and the signal becomes zero.
The Problem: Because the gate slams shut, the landscape the hiker is walking on isn't smooth; it's full of cliffs and sudden stops. This makes it very hard to predict where the hiker will end up.

2. The Two Extremes (What We Knew Before)

Before this paper, researchers knew two extreme scenarios:

The Worst Case: If the data is messy and weird, the hiker could end up anywhere. There is no predictable pattern.
The Perfect Case: If the data is perfectly organized (like a grid where every piece is perfectly orthogonal to the others), the hiker always finds the "Minimum Norm" solution.
- The Analogy: The "Minimum Norm" solution is like finding the shortest, most efficient path to the destination. It's the "laziest" solution that still works perfectly.

3. The New Discovery: The "High-Dimensional" Middle Ground

The authors asked: What happens in the real world? Real data isn't perfectly organized, but it's also not total chaos. It's "high-dimensional" (lots of features, like having thousands of puzzle pieces).

The Big Reveal:
When the data is high-dimensional (which is true for most modern AI tasks), the hiker almost finds the "Minimum Norm" (the shortest path), but not exactly.

The Gap: There is a tiny, predictable gap between where the hiker stops and the perfect shortest path.
The Size of the Gap: This gap depends on how many training examples you have ( $n$ ) versus how complex the data is ( $d$ ). The more complex the data (higher $d$ ), the smaller the gap. The hiker gets closer and closer to the "perfect" lazy solution as the data gets more high-dimensional.

4. How They Figured It Out: The "Primal-Dual" Detective Work

To solve this, the authors didn't just watch the hiker (the weights). They invented a new way of looking at the problem using Primal-Dual Analysis.

The Primal Variable (The Prediction): Think of this as the hiker's current guess at the answer.
The Dual Variable (The "Guilt"): Think of this as a scorecard tracking how much the hiker "owes" to each data point.
- If the hiker predicts a positive number for a positive label, the gate opens, and the "guilt" score updates.
- If the hiker predicts a negative number for a positive label, the gate slams shut. The "guilt" score freezes. It stops changing.

The Magic Trick:
The authors realized that in high dimensions, the "guilt" scores for the wrong examples freeze very quickly.

Positive Labels: The gate stays open. The hiker keeps adjusting to fit these perfectly.
Negative Labels: The gate slams shut immediately. The hiker stops paying attention to these specific data points entirely.

Because the hiker stops paying attention to the "negative" examples (they become inactive), the problem simplifies. The hiker effectively solves a simpler linear regression problem on just the "positive" examples. This is why the final result looks so much like the "Minimum Norm" solution, even though the rules of the game (ReLU) are non-linear.

5. The "Example Selection" Metaphor

In a standard linear model, the hiker tries to fit every data point equally.
In this ReLU model, the hiker acts like a curator.

If a data point has a positive label, the hiker says, "I will fit you perfectly."
If a data point has a negative label, the hiker says, "I'm ignoring you for now," and effectively deletes them from the puzzle.

The paper proves that in high dimensions, this "curation" process happens so reliably that the final solution is mathematically very close to the optimal "shortest path" solution, just with a tiny, calculable error margin.

Summary for the Everyday Reader

This paper explains why AI models, which seem chaotic and complex, actually behave in a very predictable way when they have lots of data.

The Algorithm: Gradient Descent is a hiker looking for the bottom of a valley.
The Twist: The valley has "ReLU" gates that block certain paths.
The Result: In high-dimensional data (lots of features), the hiker naturally sorts the data into "active" (pay attention) and "inactive" (ignore) groups.
The Conclusion: Because the hiker ignores the "inactive" group so efficiently, they end up finding a solution that is almost identical to the most efficient, "lazy" solution possible (the Minimum Norm), with only a tiny, predictable difference.

This gives us confidence that even without explicit rules telling the AI to be "simple," the math of high-dimensional data forces it to find simple, efficient solutions automatically.

Here is a detailed technical summary of the paper "How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?"

1. Problem Statement

The paper investigates the implicit bias of Gradient Descent (GD) when training shallow ReLU neural networks for regression tasks with squared loss.

Context: In overparameterized settings (where the number of parameters exceeds the number of training samples $n$ ), the training objective is underdetermined, possessing infinitely many global minima. While GD converges to one of these minima, the specific solution selected (the "implicit bias") determines generalization performance.
The Gap: Previous work established that for linear models, GD converges to the minimum $\ell_2$ $ℓ_{2}$ -norm solution. For ReLU networks, the landscape is non-convex.
- Worst-case: Vardi and Shamir (2021) showed that implicit bias is generally hard to characterize and may not exist in worst-case scenarios.
- Idealized case: Boursier et al. (2022) showed that under exact orthogonality of data features, GD converges to the minimum $\ell_2$ -norm solution.
Research Question: What is the implicit bias of GD on ReLU networks under realistic, high-dimensional random data (where features are near-orthogonal but not exactly orthogonal)?

2. Methodology

The authors employ a novel Primal-Dual Analysis framework, inspired by mirror descent, to track the dynamics of GD.

Model Setup:
- Architecture: Shallow ReLU networks with $m$ neurons ( $m=1$ and $m=2$ are analyzed in depth, with extensions to $m>2$ ).
- Loss: Squared error loss.
- Data: High-dimensional random features ( $d \gg n$ ) drawn from a distribution with zero mean and covariance $\Sigma$ . The data is assumed to be in a regime where the effective dimension is sufficiently large.
Primal-Dual Variables:
- Instead of tracking weights $w_k$ $w_{k}$ directly, the authors define:
  - Primal variables ( $\beta_k$ ): Represent predictions on training examples ( $\beta_k = X w_k$ ).
  - Dual variables ( $\alpha_k$ ): Represent coefficients in the data span ( $\alpha_k = (XX^\top)^{-1} X w_k$ ).
- Key Insight: The sign of the primal variable $\beta_{k,i}$ determines whether the $i$ -th example is "active" (ReLU output $>0$ ) or "inactive" (ReLU output $=0$ ). This binary state dictates whether the corresponding dual variable receives a gradient update.
Dynamics Tracking:
- The analysis relies on proving that under high-dimensional assumptions, the activation patterns (which examples are active/inactive) stabilize quickly (often after the first gradient step).
- Lemma 5: Shows that if a primal variable is positive and the neuron's sign matches the label, it remains positive (stays active).
- Lemma 6: Shows that if a dual variable becomes sufficiently negative, the corresponding primal variable becomes negative and stays inactive (the dual variable "freezes").

3. Key Contributions

A. Characterization of Implicit Bias for $m=1$ and $m=2$

The paper provides a complete characterization of the limiting solution for single and dual-neuron ReLU networks:

Single Neuron ( $m=1$ ): Under specific initialization (small positive values), the network converges to a solution that exactly fits all positive-labeled examples and outputs zero for all negative-labeled examples.
Two Neurons ( $m=2$ ): With one positive and one negative neuron, the network decouples:
- The positive neuron ( $w_\oplus$ ) learns to fit only the positive-labeled examples.
- The negative neuron ( $w_\ominus$ ) learns to fit only the negative-labeled examples.
- The activation patterns become disjoint and stable.

B. Approximation to Minimum $\ell_2$ -Norm Solution

The central theoretical result is that while the GD limit is not exactly the global minimum $\ell_2$ -norm solution (unlike the linear or exactly orthogonal cases), it is close to it.

The Gap: The Euclidean distance between the GD limit ( $w^{(\infty)}$ ) and the true minimum $\ell_2$ -norm solution ( $w^*$ ) scales as:
$\|w^{(\infty)} - w^*\|_2 = \Theta\left(\sqrt{\frac{n}{d}}\right)$
(More precisely, bounded by terms involving $n$ , the label range, and the spectrum of the covariance matrix).
Implication: As the dimension $d$ increases relative to the sample size $n$ , the implicit bias of ReLU networks converges to the minimum $\ell_2$ -norm solution.

C. Novel Proof Techniques

Primal-Dual Formulation: This approach allows the authors to separate the analysis of "active" and "inactive" sets, which is difficult in standard weight-space analysis due to the non-linearity of ReLU.
Stabilization of Activation Patterns: The proof demonstrates that high-dimensional concentration properties of the Gram matrix ( $XX^\top$ ) ensure that cross-sample interactions are small, allowing the activation pattern to stabilize rapidly.

4. Key Results

Theorem 1 & 3 (Convergence): For sufficiently high-dimensional data and appropriate initialization, GD converges to a specific global minimum where the network acts as a linear regressor on a specific subset of data (positive labels for positive neurons, negative for negative neurons).
Theorem 2 & 4 (Approximation Bounds):
- Upper Bound: $\|w^{(\infty)} - w^*\|_2 \leq C \sqrt{\frac{n y_{\max}^2}{\|\lambda\|_1}}$ .
- Lower Bound: $\|w^{(\infty)} - w^*\|_2 \geq c \sqrt{\frac{n y_{\min}^2}{\|\lambda\|_1}}$ .
- These bounds confirm that the gap vanishes as $d \to \infty$ (since $\|\lambda\|_1$ scales with $d$ ).
Initialization Sensitivity: The results rely on a "sufficiently small" initialization where all examples are initially active. The paper notes (via simulations) that random initialization can lead to local minima where some positive examples remain inactive, preventing convergence to the global minimum.

5. Significance and Impact

Bridging the Gap: The paper successfully bridges the gap between the "worst-case" (no bias) and "exact orthogonality" (perfect bias) extremes. It proves that for realistic high-dimensional data, ReLU networks exhibit a benign implicit bias similar to linear models.
Understanding Non-Convexity: It provides a rigorous explanation of how non-convex ReLU networks can still be analyzed using convex-like tools (minimum norm solutions) under high-dimensional regimes, thanks to the stabilization of activation patterns.
Data-Dependent Selection: Unlike linear models which interpolate all data, ReLU networks implicitly perform a data-dependent selection of which examples to interpolate (e.g., separating positive and negative labels). This selection is driven by the initialization and the geometry of the data.
Practical Relevance: The analysis uses finite step sizes (Gradient Descent) rather than continuous-time Gradient Flow, making the results more applicable to practical deep learning training scenarios.

In summary, the paper establishes that high-dimensional random features induce a strong implicit bias in ReLU networks, causing Gradient Descent to converge to a solution that is an approximation of the minimum $\ell_2$ -norm interpolator, with the approximation error diminishing as the dimensionality increases.

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

1. The Setup: The "ReLU" Hiker

2. The Two Extremes (What We Knew Before)

3. The New Discovery: The "High-Dimensional" Middle Ground

4. How They Figured It Out: The "Primal-Dual" Detective Work

5. The "Example Selection" Metaphor

Summary for the Everyday Reader

1. Problem Statement

2. Methodology

3. Key Contributions

A. Characterization of Implicit Bias for m=1m=1m=1 and m=2m=2m=2

B. Approximation to Minimum ℓ2\ell_2ℓ2​-Norm Solution

C. Novel Proof Techniques

4. Key Results

5. Significance and Impact

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

A. Characterization of Implicit Bias for $m=1$ and $m=2$

B. Approximation to Minimum $\ell_2$ -Norm Solution

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$