Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Imagine you are trying to teach a robot to separate red marbles from blue marbles on a table. You want the robot to draw a line (a decision boundary) that keeps the reds on one side and the blues on the other.

In the world of machine learning, there are many "teachers" (optimizers) that guide the robot. The most famous teacher is Adam. For years, researchers thought Adam had a very specific personality: it always preferred to draw a line that was "square" or "boxy" (mathematically, an $\ell_\infty$ -max-margin solution). It was like Adam always wanted the line to be parallel to the walls of the room, regardless of where the marbles were actually sitting.

However, this paper reveals a surprising twist: Adam's personality changes depending on how you feed it data.

Here is the breakdown of the discovery using simple analogies:

1. The Two Ways of Teaching (Full-Batch vs. Mini-Batch)

Imagine you are showing the robot the marbles.

Full-Batch (The Old Way): You show the robot all the marbles at once, calculate the average direction, and then take one step.
- Result: The robot behaves exactly as expected. It draws that "boxy," square line. It ignores the specific arrangement of the marbles and sticks to its rigid, geometric preference.
Mini-Batch / Incremental (The Modern Way): You show the robot one marble at a time (or a tiny handful), and it takes a step immediately after seeing each one. This is how modern AI is actually trained.
- Result: Adam forgets its rigid personality. Instead of drawing a boxy line, it starts drawing a line that perfectly fits the specific shape of the marbles on the table. Sometimes it draws a smooth, round line (an $\ell_2$ -max-margin solution), and sometimes it draws something in between.

The Big Discovery: The paper proves that when Adam learns one sample at a time, it stops being "boxy" and becomes "data-dependent." It adapts its shape to the specific problem it is solving, rather than sticking to a rigid rule.

2. The "Echo Chamber" Analogy

Why does this happen? Think of Adam as a hiker with a compass that has a slight delay (called momentum).

In Full-Batch: The hiker looks at the whole mountain range at once. The compass averages out all the noise, and the hiker walks in a straight, predictable direction (the "boxy" direction).
In Mini-Batch: The hiker looks at one rock at a time. Because the compass has a delay, the hiker's path gets "stuck" in a loop of echoes. The path the hiker takes depends entirely on the specific order and shape of the rocks they stepped on first. The "boxy" preference gets washed away by the specific details of the terrain.

3. The "Signum" Counter-Example

The authors didn't just stop at Adam. They tested a different teacher called Signum (which is like Adam but only cares about the direction of the step, not the size).

The Finding: Signum is stubborn. No matter if you show it all the marbles at once or one by one, it always draws that "boxy" line.
The Lesson: This proves that the "boxy" behavior isn't just a rule of the math; it's a specific quirk of how Adam handles its momentum. Signum is immune to the "mini-batch" confusion that changes Adam's mind.

4. Why Should You Care?

You might ask, "Does it matter if the line is boxy or round?"

Yes! In the real world, the shape of the line determines how well the AI works on new, unseen data (generalization).

If Adam is "boxy," it might be great for certain types of data (like text in language models) because it handles noise well.
But if Adam is "round" or "data-dependent" (because we are training it one sample at a time), it might behave differently than we expect.

The Takeaway:
For a long time, scientists thought Adam always had a "boxy" bias. This paper says: "Not so fast!" If you train Adam the modern way (one sample at a time), it becomes a chameleon. It changes its shape to fit the data.

This is a crucial warning for AI researchers: You cannot assume Adam will behave the same way in a small experiment (full-batch) as it will in a massive real-world training run (mini-batch). The "recipe" you use to feed the data changes the "taste" of the final model.

Summary in One Sentence

While the popular optimizer Adam used to be thought of as a rigid, box-drawing robot, this paper shows that when trained one step at a time, it actually becomes a flexible shapeshifter that molds itself to the specific data it sees, unlike its stubborn cousin Signum which stays rigid no matter what.

Here is a detailed technical summary of the paper "Implicit Bias of Per-Sample Adam on Separable Data: Departure from the Full-Batch Regime".

1. Problem Statement

The paper investigates the implicit bias of the Adam optimizer in the context of linear classification on linearly separable data. While previous theoretical work established that full-batch Adam converges directionally to the $\ell_\infty$ -max-margin solution (aligning with Sign Gradient Descent), there is a lack of understanding regarding mini-batch or per-sample (incremental) Adam, which is the standard setting in modern deep learning.

The core research question is: Does the characteristic $\ell_\infty$ -bias of full-batch Adam persist when using mini-batches (specifically batch size 1), or does the batching scheme alter the implicit bias?

2. Methodology

The authors analyze Incremental Adam (Inc-Adam), which processes data samples in a cyclic order (one sample per step), serving as a theoretical proxy for mini-batch Adam with batch size 1.

A. Approximation Framework

The authors address the intractability of Adam's full gradient history by deriving epoch-wise approximations:

Full-Batch (Det-Adam): They confirm that under standard assumptions, Det-Adam behaves asymptotically like Sign Gradient Descent (SignGD), leading to an $\ell_\infty$ -max-margin bias.
Incremental (Inc-Adam): They derive a recurrence relation (Proposition 2.5) showing that Inc-Adam's update is a weighted, preconditioned gradient descent. Crucially, the preconditioner (the denominator in the Adam update) tracks the sum of squared mini-batch gradients, which differs significantly from the squared full-batch gradient. This discrepancy introduces complex, data-dependent dynamics.

B. Analysis of Structured Data (Scaled Rademacher)

To isolate the effect of the batching scheme, the authors introduce Scaled Rademacher (SR) data, where the absolute value of every coordinate in every data point is identical ( $|x_i[k]| = |x_i[l]|$ ).

Result: On SR data, the coordinate-wise adaptivity of Adam is nullified. The authors prove that Inc-Adam on SR data converges to the $\ell_2$ -max-margin solution, whereas full-batch Adam converges to the $\ell_\infty$ -max-margin solution. This provides a sharp theoretical contrast.

C. General Datasets: The $\beta_2 \to 1$ Limit

For general datasets, the dynamics are too complex for direct analysis. The authors introduce a Uniform-Averaging Proxy (AdamProxy) valid in the limit where the second momentum parameter $\beta_2 \to 1$ (a common practical setting).

Fixed-Point Characterization: They characterize the convergence direction of AdamProxy as the solution to a parametric optimization problem $P_{Adam}(c)$ $P_{A d am} (c)$ .
- The problem minimizes a Mahalanobis norm $\|w\|_{M(c)}$ subject to margin constraints.
- The covariance matrix $M(c)$ is data-dependent.
- The parameter vector $c$ (representing sample weights) is determined by a dual fixed-point formulation: $c = T(c)$ , where $T$ maps the dual variables of the optimization problem back to the probability simplex.
Algorithm: They propose a fixed-point iteration algorithm (Algorithm 3) to compute this limit direction numerically.

D. Signum Analysis

To provide a counterpoint, the authors analyze Signum (SignSGD with momentum). They prove that unlike Adam, Signum retains its $\ell_\infty$ -max-margin bias for any batch size, provided the momentum parameter $\beta$ is sufficiently close to 1.

3. Key Contributions

Departure from Full-Batch Bias: The first theoretical evidence that mini-batch Adam (specifically batch size 1) can deviate from the $\ell_\infty$ -bias of full-batch Adam. On specific structured data, it converges to the $\ell_2$ -max-margin solution.
Data-Dependent Bias Characterization: For general datasets, the implicit bias is not a fixed geometry (like $\ell_2$ or $\ell_\infty$ ) but is data-dependent. It is characterized by a fixed-point solution involving a data-adaptive Mahalanobis norm.
Theoretical Framework for Inc-Adam: Development of an epoch-wise approximation method that simplifies the analysis of momentum-based incremental updates, separating the effects of momentum parameters from the data structure.
Signum Invariance: Proof that Signum maintains $\ell_\infty$ -bias regardless of batch size, highlighting that the loss of $\ell_\infty$ -bias in Adam is specific to its adaptive scaling mechanism, not just the use of sign gradients.

4. Key Results

SR Data Experiments: Empirical results confirm that on Scaled Rademacher data, Inc-Adam converges to the $\ell_2$ -max-margin direction, while full-batch Adam converges to the $\ell_\infty$ -max-margin direction.
Gaussian Data: On standard Gaussian data, mini-batch Adam converges to a direction that is neither the pure $\ell_2$ nor $\ell_\infty$ solution, but rather the fixed-point solution predicted by their theory.
Shifted-Diagonal Data: On specific shifted-diagonal datasets, the fixed-point solution coincides with the $\ell_\infty$ -max-margin solution, demonstrating that the bias can revert to $\ell_\infty$ depending on data geometry.
Batch Size Sensitivity: Experiments show that as the batch size increases (approaching full-batch), the convergence direction of mini-batch Adam shifts back toward the $\ell_\infty$ -max-margin solution.
Signum Robustness: Experiments verify that Signum consistently converges to the $\ell_\infty$ -max-margin solution across all batch sizes (1, 2, 5, 10) when momentum is high ( $\beta=0.99$ ).

5. Significance

Theoretical Gap: This work bridges a critical gap in the theoretical understanding of Adam, moving beyond the full-batch regime to explain the behavior of the standard stochastic training setting.
Implications for Deep Learning: The findings suggest that the "favorable $\ell_\infty$ -geometry" often cited as the reason for Adam's success in language models might vanish in small-batch training regimes. This offers a potential explanation for why Adam's advantage over SGD diminishes with smaller batch sizes.
Algorithm Design: The distinction between Adam and Signum implies that if $\ell_\infty$ -geometry is desired for generalization, Signum might be a more robust choice across different batch sizes, whereas Adam's bias is highly sensitive to data structure and batching.
Future Directions: The paper opens avenues for analyzing how momentum hyperparameters ( $\beta_1, \beta_2$ ) and batch sizes interact to shape the implicit bias, suggesting that the "optimal" optimizer depends heavily on the specific data distribution and training regime.

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

1. The Two Ways of Teaching (Full-Batch vs. Mini-Batch)

2. The "Echo Chamber" Analogy

3. The "Signum" Counter-Example

4. Why Should You Care?

Summary in One Sentence

1. Problem Statement

2. Methodology

A. Approximation Framework

B. Analysis of Structured Data (Scaled Rademacher)

C. General Datasets: The β2→1\beta_2 \to 1β2​→1 Limit

D. Signum Analysis

3. Key Contributions

4. Key Results

5. Significance

More like this

Explainable machine learning for predicting shellfish toxicity in the Adriatic Sea using long-term monitoring data of HABs

Talking like Piping and Instrumentation Diagrams (P&IDs)

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

IntrinsicWeather: Controllable Weather Editing in Intrinsic Space

SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving

C. General Datasets: The $\beta_2 \to 1$ Limit