Per-example gradients: a new frontier for understanding and improving optimizers

This paper demonstrates that per-example gradients can be efficiently computed with minimal overhead in modern deep learning frameworks, enabling new insights into optimizer design such as the optimal placement of sign operations in signSGD and the superiority of mean-dominated preconditioners over variance-dominated ones in Adam.

Vincent Roulet, Atish Agarwala

Published 2026-03-03
📖 6 min read🧠 Deep dive

The Big Idea: Looking at the Class, Not Just the Average

Imagine you are a teacher grading a class of 30 students.

  • The Old Way (Standard AI): You collect all 30 tests, calculate the average score, and then give the whole class a single piece of advice: "Everyone, study a bit harder on algebra." You throw away the individual tests. You don't know who struggled with fractions or who aced geometry; you only know the class average.
  • The New Way (This Paper): The authors say, "Wait a minute! What if we kept every single test? What if we looked at the distribution of scores? Maybe we could see that while the average is okay, half the class is failing algebra while the other half is bored."

In Deep Learning, the "tests" are the data samples, and the "score" is the gradient (the direction the AI needs to move to get smarter). Usually, AI algorithms throw away the individual directions and only keep the average. This paper argues that keeping the individual directions (per-example gradients) is actually cheap and easy to do, and it unlocks a treasure trove of new ways to make AI train faster and more stably.


1. The Myth: "It's Too Expensive"

For a long time, researchers believed that saving every individual test score was like trying to carry 30 suitcases when you only need to carry one. They thought it would take too much memory (RAM) and too much time.

The Reality Check:
The authors discovered that for modern AI models (like the ones that write text or recognize images), the "suitcases" are actually already being carried by the computer for other reasons.

  • The Analogy: Imagine a factory assembly line. The workers (the computer) are already holding the raw materials (activations) for every single product on the line to build the final item. The authors realized they could just take a quick snapshot of the materials before they are mixed together, without needing to buy new shelves or stop the line.
  • The Result: They proved that with modern tools (like a programming language called JAX), you can look at every single data point's contribution with almost zero extra cost. It's like having a superpower to see the details without paying extra for the ticket.

2. The Experiment: Two New Ways to Teach

Once they could see the individual data points, they tried two new ways to teach the AI, comparing them to the old methods.

Experiment A: The "Sign" of the Gradient (SIGNSGD)

Imagine the AI is trying to find the bottom of a valley in the dark.

  • The Old Method: The AI asks the whole class, "Which way is down?" The class shouts, "Down!" (The average). The AI takes a step.
  • The New Question: Does it matter when we ask the question?
    • Option 1: Ask every student individually, "Is it down?", get 30 "Yes/No" answers, average them, then take a step.
    • Option 2: Let the students shout their answers, average the noise, and then ask, "Is the average answer 'Down'?"

The Finding: The authors found that Option 2 is much better.

  • The Metaphor: If you ask one person in a noisy crowd, "Is it down?", they might be wrong or confused (noise). If you ask the whole crowd and average their voices, the noise cancels out, and you get a clear signal. If you try to filter the noise before averaging (by asking individuals first), you amplify the confusion.
  • Conclusion: You should let the AI average all the data first to get a clear signal, and then simplify the direction. Doing it the other way around makes the AI stumble.

Experiment B: The "Preconditioner" (ADAM)

This is the most surprising part. The AI uses a "preconditioner" (a fancy steering wheel) to decide how big of a step to take.

  • The Old Wisdom: The steering wheel is usually tuned based on how much the students' answers vary (the variance). If everyone agrees, take a big step. If they disagree, take a small, cautious step.
  • The New Discovery: The authors looked at the individual data and realized the old wisdom was slightly wrong.
    • They found that the AI actually learns best when the steering wheel is tuned based on the average squared strength of the answers, not just how much they disagree.
    • The Metaphor: Imagine driving a car. The old rule was: "If the road is bumpy (high variance), slow down." The new rule is: "If the road is generally strong and solid (high mean squared), you can drive fast, even if there are some bumps."
    • The Result: They built a new version of the popular "Adam" optimizer that focuses on the strength of the signal rather than the noise. This new version trained slightly faster and more stably than the standard version.

3. Why This Matters

This paper is a "toolkit" paper. It doesn't just invent one new AI; it invents a new way of looking at the problem.

  • Before: We thought looking at individual data points was too expensive, so we only looked at the average. We were flying blind, only seeing the horizon.
  • Now: We have a high-resolution map. We can see the bumps and the smooth roads individually.
  • The Future: Because we can now easily see these details, we can design better algorithms. We can stop guessing how to tune our AI and start engineering it based on the actual behavior of the data.

Summary in One Sentence

The authors showed that it's surprisingly cheap to look at every single piece of data an AI learns from, and doing so reveals that we've been driving our AI cars with the wrong steering rules—focusing on the "noise" instead of the "signal"—and fixing this makes them drive much better.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →