Per-example gradients: a new frontier for understanding and improving optimizers

The Big Idea: Looking at the Class, Not Just the Average

Imagine you are a teacher grading a class of 30 students.

The Old Way (Standard AI): You collect all 30 tests, calculate the average score, and then give the whole class a single piece of advice: "Everyone, study a bit harder on algebra." You throw away the individual tests. You don't know who struggled with fractions or who aced geometry; you only know the class average.
The New Way (This Paper): The authors say, "Wait a minute! What if we kept every single test? What if we looked at the distribution of scores? Maybe we could see that while the average is okay, half the class is failing algebra while the other half is bored."

In Deep Learning, the "tests" are the data samples, and the "score" is the gradient (the direction the AI needs to move to get smarter). Usually, AI algorithms throw away the individual directions and only keep the average. This paper argues that keeping the individual directions (per-example gradients) is actually cheap and easy to do, and it unlocks a treasure trove of new ways to make AI train faster and more stably.

1. The Myth: "It's Too Expensive"

For a long time, researchers believed that saving every individual test score was like trying to carry 30 suitcases when you only need to carry one. They thought it would take too much memory (RAM) and too much time.

The Reality Check:
The authors discovered that for modern AI models (like the ones that write text or recognize images), the "suitcases" are actually already being carried by the computer for other reasons.

The Analogy: Imagine a factory assembly line. The workers (the computer) are already holding the raw materials (activations) for every single product on the line to build the final item. The authors realized they could just take a quick snapshot of the materials before they are mixed together, without needing to buy new shelves or stop the line.
The Result: They proved that with modern tools (like a programming language called JAX), you can look at every single data point's contribution with almost zero extra cost. It's like having a superpower to see the details without paying extra for the ticket.

2. The Experiment: Two New Ways to Teach

Once they could see the individual data points, they tried two new ways to teach the AI, comparing them to the old methods.

Experiment A: The "Sign" of the Gradient (SIGNSGD)

Imagine the AI is trying to find the bottom of a valley in the dark.

The Old Method: The AI asks the whole class, "Which way is down?" The class shouts, "Down!" (The average). The AI takes a step.
The New Question: Does it matter when we ask the question?
- Option 1: Ask every student individually, "Is it down?", get 30 "Yes/No" answers, average them, then take a step.
- Option 2: Let the students shout their answers, average the noise, and then ask, "Is the average answer 'Down'?"

The Finding: The authors found that Option 2 is much better.

The Metaphor: If you ask one person in a noisy crowd, "Is it down?", they might be wrong or confused (noise). If you ask the whole crowd and average their voices, the noise cancels out, and you get a clear signal. If you try to filter the noise before averaging (by asking individuals first), you amplify the confusion.
Conclusion: You should let the AI average all the data first to get a clear signal, and then simplify the direction. Doing it the other way around makes the AI stumble.

Experiment B: The "Preconditioner" (ADAM)

This is the most surprising part. The AI uses a "preconditioner" (a fancy steering wheel) to decide how big of a step to take.

The Old Wisdom: The steering wheel is usually tuned based on how much the students' answers vary (the variance). If everyone agrees, take a big step. If they disagree, take a small, cautious step.
The New Discovery: The authors looked at the individual data and realized the old wisdom was slightly wrong.
- They found that the AI actually learns best when the steering wheel is tuned based on the average squared strength of the answers, not just how much they disagree.
- The Metaphor: Imagine driving a car. The old rule was: "If the road is bumpy (high variance), slow down." The new rule is: "If the road is generally strong and solid (high mean squared), you can drive fast, even if there are some bumps."
- The Result: They built a new version of the popular "Adam" optimizer that focuses on the strength of the signal rather than the noise. This new version trained slightly faster and more stably than the standard version.

3. Why This Matters

This paper is a "toolkit" paper. It doesn't just invent one new AI; it invents a new way of looking at the problem.

Before: We thought looking at individual data points was too expensive, so we only looked at the average. We were flying blind, only seeing the horizon.
Now: We have a high-resolution map. We can see the bumps and the smooth roads individually.
The Future: Because we can now easily see these details, we can design better algorithms. We can stop guessing how to tune our AI and start engineering it based on the actual behavior of the data.

Summary in One Sentence

The authors showed that it's surprisingly cheap to look at every single piece of data an AI learns from, and doing so reveals that we've been driving our AI cars with the wrong steering rules—focusing on the "noise" instead of the "signal"—and fixing this makes them drive much better.

1. Problem Statement

Deep learning training typically relies on reverse-mode automatic differentiation (AD), which treats the mini-batch as a fundamental unit. Standard optimizers only receive the batch-averaged gradient ( $\frac{1}{B}\sum \nabla f(\theta; x_i)$ ). Consequently, they lack access to higher-order statistics of the gradient distribution (e.g., per-example variance, sign distributions, or specific non-linear transformations of individual gradients).

While accessing per-example gradients is theoretically valuable for understanding loss landscapes, predicting scaling laws, and designing robust optimizers, it has traditionally been viewed as prohibitively expensive due to:

Memory Overhead: Storing $B$ individual gradients requires $B$ times the memory of a single gradient, often exceeding hardware limits.
Computational Overhead: Naive implementations require $B$ separate forward/backward passes or complex custom kernels.

The authors challenge this view, arguing that modern architectures (like Transformers) and staged programming languages (like JAX) offer opportunities to compute these statistics with negligible overhead.

2. Methodology

The paper proposes a framework to access and manipulate per-example gradient statistics efficiently using two main approaches:

A. Memory Efficiency via Architectural Properties

The authors identify that in sequence-level architectures (e.g., Transformers), the memory required to store activations during the forward pass often exceeds the memory required to store individual gradients.

Fact: For layers where input size > parameter size (common in Transformers), the memory reserved for input checkpoints can be repurposed to store $B$ individual gradients without increasing peak memory usage.
Result: This allows the computation of non-linear statistics (like $\frac{1}{B}\sum g_i^2$ ) without the $O(B \times P)$ memory penalty usually associated with naive per-example storage.

B. Computational Graph Surgery

Instead of using naive vectorization (which can be slow), the authors utilize JAX to perform "surgical" modifications on the computational graph generated by AD.

Mechanism: In reverse-mode AD, the averaging of gradients over the batch is typically the final operation (a sum reduction). The authors parse the computational graph (specifically the jaxpr representation) to inject non-linear operations ( $\phi$ ) before this final reduction.
Factorable Operations: For operations where $\phi(ab) = \phi(a)\phi(b)$ (e.g., squaring, sign functions), the authors exploit the rank-one structure of gradients in dense layers. They modify the backward pass to compute $\sum \phi(g_i)$ directly as $\phi(\text{inputs}) \odot \phi(\text{co-tangents})$ , avoiding the need to materialize all $B$ gradients explicitly.
Performance: This approach (JAXpr surgery) is shown to be significantly faster than standard vmap vectorization for large batch sizes, often by an order of magnitude, while maintaining negligible memory overhead in Transformer workloads.

3. Key Contributions & Experiments

The authors apply their methodology to re-examine two popular optimization algorithms: SignSGD and Adam.

A. SignSGD: The Optimal Placement of the Sign Function

The authors compare three variants of SignSGD based on the order of operations:

SIGNEMA: $sign(\text{EMA}(\text{avg}(g)))$
SIGNSGD: $\text{EMA}(sign(\text{avg}(g)))$
MICROSIGNSGD: $\text{EMA}(\text{avg}(sign(g)))$ (Per-example sign)

Finding: SIGNEMA performs best, while MICROSIGNSGD performs worst (noisy and unstable).
Analysis: Using a Signal-to-Noise Ratio (SNR) argument, the authors show that the sign function reduces SNR for low-SNR distributions. Applying the sign function after maximal averaging (SIGNEMA) preserves the signal better than applying it to noisy, individual gradients.
Conclusion: The sign operation should be applied as late as possible in the processing chain.

B. Adam: Mean Squared vs. Variance in Preconditioning

The authors investigate the preconditioner in Adam. Standard Adam uses the square of the average gradient ( $\nu_{adam} = (\frac{1}{B}\sum g_i)^2$ ), which approximates $\mu^2 + \sigma^2/B$ . They compare this to MicroAdam, which uses the average of squared gradients ( $\nu_{micro} = \frac{1}{B}\sum g_i^2$ ), approximating $\mu^2 + \sigma^2$ .

Finding 1 (Stability): Standard Adam is more stable and trains faster than MicroAdam.
Finding 2 (The "Mean Squared" Dominance): By estimating $\mu^2$ (mean squared) and $\sigma^2$ (variance) directly during training, they found that in standard Adam, the mean squared term ( $\mu^2$ ) dominates the variance term ( $\sigma^2/B$ ) throughout training.
New Algorithm (MicroAdamMSQ): They constructed an optimizer that explicitly uses the estimated mean squared term ( $\hat{\mu}^2$ $\overset{μ}{^}^{2}$ ) as the preconditioner.
- Result: This variant trains slightly better and more stably than standard Adam and exhibits universal scaling with batch size (using a linear learning rate rule $\eta \propto B$ ) rather than the square-root rule ( $\eta \propto \sqrt{B}$ ) typically associated with Adam.
Contradiction to Conventional Wisdom: The paper challenges the idea that Adam's success relies on variance estimation. Instead, it suggests Adam works well because it implicitly captures the mean squared signal, and explicit variance-based preconditioners (like MicroAdam) can be detrimental.

4. Results Summary

Efficiency: Per-example gradient computation in Transformers incurs negligible memory overhead (peak memory remains unchanged) and modest time overhead (often <2x, sometimes less with graph surgery).
SignSGD: Applying the sign function after averaging (SIGNEMA) is superior to per-example sign application.
Adam:
- Standard Adam is superior to MicroAdam (variance-dominated).
- The "Mean Squared" dominated variant (MicroAdamMSQ) shows the best performance and stability, suggesting that preconditioners should prioritize the mean squared gradient over variance.
- Batch size scaling for these variants follows a linear rule ( $\eta \propto B$ ) when the mean squared term dominates, contrasting with the square-root rule for variance-dominated regimes.

5. Significance

Democratizing Gradient Statistics: The work demonstrates that accessing per-example gradient information is not just theoretically possible but practically tractable at scale, unlocking a new dimension for algorithm design.
Algorithmic Insights: It provides a rigorous explanation for the behavior of SignSGD and Adam, correcting misconceptions about the role of variance in Adam's preconditioner.
Future Directions: The "computational graph surgery" approach opens the door for designing new optimizers that leverage higher-order moments, distributional properties, or specific non-linear transformations of gradients without prohibitive computational costs. It suggests that the "black box" of gradient averaging hides valuable information that can be exploited for more robust and efficient training.