Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

This paper analyzes one-pass SGD dynamics in overparameterized quadratic networks, revealing that while overparameterization only modestly accelerates escape from poor generalization plateaus, the algorithm's implicit bias—driven by unconstrained weight norms and conserved quantities—selects the zero-loss solution closest to the random initialization within a continuous manifold of solutions.

Original authors: Dario Bocchi, Theotime Regimbeau, Carlo Lucibello, Luca Saglietti, Chiara Cammarota

Published 2026-04-06
📖 6 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: A Game of "Copycat"

Imagine you are trying to teach a robot (the Student) to mimic a master chef (the Teacher). The master chef has a secret recipe (the Teacher's weights) that turns ingredients (input data) into a perfect dish (the output). Your robot has its own set of knobs and dials (the Student's weights) that it can turn to try and recreate that dish.

The goal is simple: Turn the robot's knobs until the dish it makes tastes exactly like the master chef's.

This paper studies what happens when:

  1. The robot is overparameterized: It has more knobs than the master chef does.
  2. The robot learns one-pass: It gets to taste the ingredients and adjust its knobs only once per sample, never seeing the same ingredient twice.
  3. The "flavor" math is quadratic: The taste depends on the square of the knob settings, which creates a very specific, bumpy landscape of possibilities.

1. The "Flat Plateau" Problem

When the robot starts learning, it usually begins with its knobs set to zero or random values. In this specific type of math (quadratic), the robot hits a Plateau.

  • The Analogy: Imagine the robot is standing on a giant, perfectly flat, foggy meadow. No matter which way it takes a step, the ground feels exactly the same. There is no "downhill" slope to guide it toward the solution.
  • The Result: The robot wanders around aimlessly for a long time, unable to figure out how to get better. This is called the "uninformative plateau."

Does having more knobs help?
You might think, "If the robot has more knobs (overparameterization), it should find the way out faster."

  • The Finding: Surprisingly, no. Having more knobs doesn't change the shape of the foggy meadow. It just means there are more legs trying to walk at the same time. The robot still gets stuck for roughly the same amount of time. The only difference is that having more knobs makes the robot slightly louder in its attempts to escape, but the time it takes to finally break free is determined by how complex the Master Chef's recipe is, not how many knobs the robot has.

2. The "Lake of Solutions"

Once the robot finally escapes the foggy plateau, it reaches the "Zero Error" zone. This is where the robot finally makes the perfect dish.

  • The Analogy: In simple problems, there is usually just one perfect spot on the map where the robot can stand to make the dish. But in this complex, overparameterized world, the "perfect spot" isn't a single dot. It's a giant, continuous lake.
  • Why? Because the robot has extra knobs. You can rotate the knobs in many different ways, and as long as the overall shape of the settings remains the same, the dish tastes perfect. It's like having a team of 10 people carrying a table; you can swap who stands where, and the table stays level. There are infinite ways to arrange the team to get the same result.

3. The "Lazy Traveler" (Implicit Bias)

Here is the most fascinating part. Since there is a whole "lake" of perfect solutions, which one does the robot pick? Does it pick the one closest to the center? The one with the most symmetrical knobs?

  • The Finding: The robot is incredibly lazy. It picks the solution that is closest to where it started.
  • The Analogy: Imagine you are dropped in the middle of a giant, flat lake of perfect solutions. You have a compass that points to "Home" (your starting random position). You don't swim to the far side of the lake just because it looks nicer. You simply walk the shortest distance to the nearest point on the shore that satisfies the "perfect dish" rule.
  • The Science: The paper proves that the learning process (SGD) has a "conserved quantity." It's like a physical law that says, "You cannot change your distance from your starting point in a specific way." The robot is mathematically forced to stop at the solution closest to its random initialization.

4. The "Hill and Valley" Map

The researchers also looked at the "terrain" of the problem using a tool called the Hessian (which measures the steepness and curvature of the ground).

  • The Plateau: They found that the "foggy meadow" the robot gets stuck in isn't just flat; it's a saddle. It's flat in some directions (where the robot wanders) but has a hidden "downhill" slope in other directions that eventually lets the robot escape.
  • The Lake: The "lake" of perfect solutions isn't a deep pit. It's a marginal minimum. It's flat along the surface of the lake (because you can rotate the knobs without changing the taste), but if you try to step off the lake, you immediately go uphill.

Summary of Key Takeaways

  1. More isn't always faster: Giving the student network more capacity (more neurons) doesn't magically solve the "stuck in the fog" problem. It only helps a tiny bit by changing the speed of the escape, not the time it takes.
  2. Infinite Solutions: When the student is bigger than the teacher, there isn't just one right answer. There is a whole continuous family of perfect answers (a manifold).
  3. Initialization is Destiny: The specific random way the robot starts determines exactly which perfect solution it will find. The learning algorithm acts like a magnet, pulling the robot to the closest possible solution to its starting point.
  4. Symmetry Rules: The reason there are so many solutions is due to a hidden symmetry (rotational invariance). The math allows the robot to spin its internal gears in different ways without changing the final output.

In a nutshell: This paper explains that in complex learning scenarios, the path you take is less about the destination and more about where you started. The learning algorithm doesn't search for the "best" solution in a global sense; it simply finds the "closest" solution to your starting point, and it gets stuck in a flat fog for a while before it finally finds the exit.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →