Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

The paper proposes Generalized Primal Averaging (GPA), a memory-efficient optimizer that unifies and improves upon DiLoCo and Schedule-Free by decoupling Nesterov's interpolation constants to enable smooth step-wise averaging, thereby achieving faster convergence and reduced memory overhead across various LLM and vision model training tasks.

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a giant, super-smart robot (a Large Language Model) how to write, code, and reason. The robot learns by reading millions of books and making guesses, then correcting its mistakes. The "optimizer" is the teacher guiding this process, deciding how big of a step the robot should take to get better.

For a long time, the standard teacher was AdamW. It's reliable, but sometimes it's a bit slow or gets stuck in local ruts.

Recently, a new teacher called DiLoCo became popular. It's like a teacher who says, "Let's take a bunch of tiny steps to figure out the direction, then take one giant, confident leap forward." This works great, but it has a weird quirk: it has to stop, calculate, and reset its memory every few steps. It's like a runner who has to stop at every mile marker to check a map before running again. It works, but it's clunky and requires a lot of mental energy (memory) to keep track of all those stops.

Another new teacher, Schedule-Free, tried to fix this by averaging the robot's past positions to smooth out the path. But it used a "uniform average," which is like giving equal weight to the robot's position from 10 years ago and its position 1 second ago. That doesn't always make sense for a fast-moving robot.

Enter GPA: The "Smooth Operator"

The authors of this paper propose a new teacher called Generalized Primal Averaging (GPA).

Think of GPA as the perfect blend of the previous two methods, but with a few clever upgrades:

  1. The "Smooth" Leap:
    Imagine you are driving a car.

    • Old DiLoCo is like driving, stopping every 30 seconds to calculate the perfect turn, then jerking the wheel. It's effective but choppy.
    • GPA is like having a "smart cruise control" that constantly adjusts the steering wheel smoothly while you drive. It doesn't stop to calculate; it just gently nudges the car in the right direction at every single moment.
  2. The "Decoupled" Steering Wheel:
    The magic of GPA is that it separates two things that were previously tied together:

    • Where you look (Gradient): Where the robot is looking to see what's wrong.
    • Where you go (Update): Where the robot actually moves.
      In older methods, these were locked together. If you wanted to look further ahead, you had to move differently. GPA uses two separate "knobs" (parameters). You can turn one knob to look further ahead without messing up the other knob that controls how you move. This gives the teacher much more flexibility to find the fastest path.
  3. The "Exponential" Memory:
    Instead of remembering the past equally (like Schedule-Free), GPA uses Exponential Moving Average.

    • Analogy: Imagine you are learning a song.
      • Uniform Average: You remember the first note you played 100 times ago just as clearly as the note you played 1 second ago.
      • GPA (Exponential): You remember the note you played 1 second ago very clearly, the one from 2 seconds ago a little less, and the one from 100 seconds ago is fuzzy. This makes sense because the recent mistakes are usually more important for correcting your current path.

Why Does This Matter?

The paper shows that GPA is faster, cheaper, and easier to use than the previous best methods.

  • Faster Training: In tests with different-sized AI models (from small to huge), GPA reached the same level of intelligence in fewer steps.

    • For a small model, it was about 8.7% faster.
    • For a medium model, it was 10% faster.
    • For a huge model, it was 9.6% faster.
    • Real-world impact: If training a model usually takes 100 days, GPA might get it done in 90 days, saving millions of dollars in electricity and computer time.
  • Less Memory: DiLoCo needed to store extra copies of the robot's brain to do its "stop-and-check" routine. GPA is more efficient; it needs less memory, which means you can train bigger models on the same hardware.

  • Simpler Tuning: DiLoCo had a lot of knobs to turn (how many steps to take, how fast to move, etc.). GPA has fewer knobs, making it easier for engineers to use without needing a PhD in math to get it working.

The Bottom Line

The authors took a complex, stop-and-go training method (DiLoCo) and a slightly rigid averaging method (Schedule-Free) and fused them into a smooth, continuous, and highly efficient training process.

It's like upgrading from a runner who has to stop and check a map every few blocks to a runner with a GPS that gently guides them around every corner in real-time. The result? The AI learns faster, uses less energy, and gets smarter sooner.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →