Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

This paper demonstrates that the implicit bias of per-sample Adam on separable data can deviate from the full-batch \ell_\infty-max-margin behavior, potentially converging to the 2\ell_2-max-margin classifier or a data-adaptive Mahalanobis-norm margin depending on the dataset, whereas Signum consistently converges to the \ell_\infty-max-margin regardless of batch size.

Beomhan Baek, Minhak Song, Chulhee Yun

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to separate red marbles from blue marbles on a table. You want the robot to draw a line (a decision boundary) that keeps the reds on one side and the blues on the other.

In the world of machine learning, there are many "teachers" (optimizers) that guide the robot. The most famous teacher is Adam. For years, researchers thought Adam had a very specific personality: it always preferred to draw a line that was "square" or "boxy" (mathematically, an \ell_\infty-max-margin solution). It was like Adam always wanted the line to be parallel to the walls of the room, regardless of where the marbles were actually sitting.

However, this paper reveals a surprising twist: Adam's personality changes depending on how you feed it data.

Here is the breakdown of the discovery using simple analogies:

1. The Two Ways of Teaching (Full-Batch vs. Mini-Batch)

Imagine you are showing the robot the marbles.

  • Full-Batch (The Old Way): You show the robot all the marbles at once, calculate the average direction, and then take one step.
    • Result: The robot behaves exactly as expected. It draws that "boxy," square line. It ignores the specific arrangement of the marbles and sticks to its rigid, geometric preference.
  • Mini-Batch / Incremental (The Modern Way): You show the robot one marble at a time (or a tiny handful), and it takes a step immediately after seeing each one. This is how modern AI is actually trained.
    • Result: Adam forgets its rigid personality. Instead of drawing a boxy line, it starts drawing a line that perfectly fits the specific shape of the marbles on the table. Sometimes it draws a smooth, round line (an 2\ell_2-max-margin solution), and sometimes it draws something in between.

The Big Discovery: The paper proves that when Adam learns one sample at a time, it stops being "boxy" and becomes "data-dependent." It adapts its shape to the specific problem it is solving, rather than sticking to a rigid rule.

2. The "Echo Chamber" Analogy

Why does this happen? Think of Adam as a hiker with a compass that has a slight delay (called momentum).

  • In Full-Batch: The hiker looks at the whole mountain range at once. The compass averages out all the noise, and the hiker walks in a straight, predictable direction (the "boxy" direction).
  • In Mini-Batch: The hiker looks at one rock at a time. Because the compass has a delay, the hiker's path gets "stuck" in a loop of echoes. The path the hiker takes depends entirely on the specific order and shape of the rocks they stepped on first. The "boxy" preference gets washed away by the specific details of the terrain.

3. The "Signum" Counter-Example

The authors didn't just stop at Adam. They tested a different teacher called Signum (which is like Adam but only cares about the direction of the step, not the size).

  • The Finding: Signum is stubborn. No matter if you show it all the marbles at once or one by one, it always draws that "boxy" line.
  • The Lesson: This proves that the "boxy" behavior isn't just a rule of the math; it's a specific quirk of how Adam handles its momentum. Signum is immune to the "mini-batch" confusion that changes Adam's mind.

4. Why Should You Care?

You might ask, "Does it matter if the line is boxy or round?"

Yes! In the real world, the shape of the line determines how well the AI works on new, unseen data (generalization).

  • If Adam is "boxy," it might be great for certain types of data (like text in language models) because it handles noise well.
  • But if Adam is "round" or "data-dependent" (because we are training it one sample at a time), it might behave differently than we expect.

The Takeaway:
For a long time, scientists thought Adam always had a "boxy" bias. This paper says: "Not so fast!" If you train Adam the modern way (one sample at a time), it becomes a chameleon. It changes its shape to fit the data.

This is a crucial warning for AI researchers: You cannot assume Adam will behave the same way in a small experiment (full-batch) as it will in a massive real-world training run (mini-batch). The "recipe" you use to feed the data changes the "taste" of the final model.

Summary in One Sentence

While the popular optimizer Adam used to be thought of as a rigid, box-drawing robot, this paper shows that when trained one step at a time, it actually becomes a flexible shapeshifter that molds itself to the specific data it sees, unlike its stubborn cousin Signum which stays rigid no matter what.