How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

This paper demonstrates that for high-dimensional random data, gradient descent on shallow ReLU networks exhibits an implicit bias that approximates the minimum L2L_2-norm solution with high probability, bridging the gap between worst-case non-existence and exact orthogonality results through a novel primal-dual analysis.

Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to solve a massive puzzle, but you have way more puzzle pieces than you need. In fact, you have so many pieces that there are millions of different ways to complete the picture perfectly. This is what happens in modern AI when we use "overparameterized" neural networks: there are countless perfect solutions to the training problem.

So, how does the computer decide which solution to pick? It uses an algorithm called Gradient Descent, which is like a hiker trying to find the lowest point in a foggy valley. The hiker doesn't see the whole map; they just take small steps downhill. The "Implicit Bias" of this paper is the study of which specific valley the hiker accidentally ends up in, even though they didn't plan to go there.

Here is the story of what this paper discovered, explained through simple analogies.

1. The Setup: The "ReLU" Hiker

The paper focuses on a specific type of neural network using an activation function called ReLU (Rectified Linear Unit).

  • The Analogy: Imagine ReLU is a strict gatekeeper. If a signal is positive, the gate opens and lets it through. If the signal is negative, the gate slams shut, and the signal becomes zero.
  • The Problem: Because the gate slams shut, the landscape the hiker is walking on isn't smooth; it's full of cliffs and sudden stops. This makes it very hard to predict where the hiker will end up.

2. The Two Extremes (What We Knew Before)

Before this paper, researchers knew two extreme scenarios:

  • The Worst Case: If the data is messy and weird, the hiker could end up anywhere. There is no predictable pattern.
  • The Perfect Case: If the data is perfectly organized (like a grid where every piece is perfectly orthogonal to the others), the hiker always finds the "Minimum Norm" solution.
    • The Analogy: The "Minimum Norm" solution is like finding the shortest, most efficient path to the destination. It's the "laziest" solution that still works perfectly.

3. The New Discovery: The "High-Dimensional" Middle Ground

The authors asked: What happens in the real world? Real data isn't perfectly organized, but it's also not total chaos. It's "high-dimensional" (lots of features, like having thousands of puzzle pieces).

The Big Reveal:
When the data is high-dimensional (which is true for most modern AI tasks), the hiker almost finds the "Minimum Norm" (the shortest path), but not exactly.

  • The Gap: There is a tiny, predictable gap between where the hiker stops and the perfect shortest path.
  • The Size of the Gap: This gap depends on how many training examples you have (nn) versus how complex the data is (dd). The more complex the data (higher dd), the smaller the gap. The hiker gets closer and closer to the "perfect" lazy solution as the data gets more high-dimensional.

4. How They Figured It Out: The "Primal-Dual" Detective Work

To solve this, the authors didn't just watch the hiker (the weights). They invented a new way of looking at the problem using Primal-Dual Analysis.

  • The Primal Variable (The Prediction): Think of this as the hiker's current guess at the answer.
  • The Dual Variable (The "Guilt"): Think of this as a scorecard tracking how much the hiker "owes" to each data point.
    • If the hiker predicts a positive number for a positive label, the gate opens, and the "guilt" score updates.
    • If the hiker predicts a negative number for a positive label, the gate slams shut. The "guilt" score freezes. It stops changing.

The Magic Trick:
The authors realized that in high dimensions, the "guilt" scores for the wrong examples freeze very quickly.

  • Positive Labels: The gate stays open. The hiker keeps adjusting to fit these perfectly.
  • Negative Labels: The gate slams shut immediately. The hiker stops paying attention to these specific data points entirely.

Because the hiker stops paying attention to the "negative" examples (they become inactive), the problem simplifies. The hiker effectively solves a simpler linear regression problem on just the "positive" examples. This is why the final result looks so much like the "Minimum Norm" solution, even though the rules of the game (ReLU) are non-linear.

5. The "Example Selection" Metaphor

In a standard linear model, the hiker tries to fit every data point equally.
In this ReLU model, the hiker acts like a curator.

  • If a data point has a positive label, the hiker says, "I will fit you perfectly."
  • If a data point has a negative label, the hiker says, "I'm ignoring you for now," and effectively deletes them from the puzzle.

The paper proves that in high dimensions, this "curation" process happens so reliably that the final solution is mathematically very close to the optimal "shortest path" solution, just with a tiny, calculable error margin.

Summary for the Everyday Reader

This paper explains why AI models, which seem chaotic and complex, actually behave in a very predictable way when they have lots of data.

  1. The Algorithm: Gradient Descent is a hiker looking for the bottom of a valley.
  2. The Twist: The valley has "ReLU" gates that block certain paths.
  3. The Result: In high-dimensional data (lots of features), the hiker naturally sorts the data into "active" (pay attention) and "inactive" (ignore) groups.
  4. The Conclusion: Because the hiker ignores the "inactive" group so efficiently, they end up finding a solution that is almost identical to the most efficient, "lazy" solution possible (the Minimum Norm), with only a tiny, predictable difference.

This gives us confidence that even without explicit rules telling the AI to be "simple," the math of high-dimensional data forces it to find simple, efficient solutions automatically.