Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Imagine you are trying to teach a giant, super-smart robot (a Large Language Model) how to write, code, and reason. The robot learns by reading millions of books and making guesses, then correcting its mistakes. The "optimizer" is the teacher guiding this process, deciding how big of a step the robot should take to get better.

For a long time, the standard teacher was AdamW. It's reliable, but sometimes it's a bit slow or gets stuck in local ruts.

Recently, a new teacher called DiLoCo became popular. It's like a teacher who says, "Let's take a bunch of tiny steps to figure out the direction, then take one giant, confident leap forward." This works great, but it has a weird quirk: it has to stop, calculate, and reset its memory every few steps. It's like a runner who has to stop at every mile marker to check a map before running again. It works, but it's clunky and requires a lot of mental energy (memory) to keep track of all those stops.

Another new teacher, Schedule-Free, tried to fix this by averaging the robot's past positions to smooth out the path. But it used a "uniform average," which is like giving equal weight to the robot's position from 10 years ago and its position 1 second ago. That doesn't always make sense for a fast-moving robot.

Enter GPA: The "Smooth Operator"

The authors of this paper propose a new teacher called Generalized Primal Averaging (GPA).

Think of GPA as the perfect blend of the previous two methods, but with a few clever upgrades:

The "Smooth" Leap:
Imagine you are driving a car.
- Old DiLoCo is like driving, stopping every 30 seconds to calculate the perfect turn, then jerking the wheel. It's effective but choppy.
- GPA is like having a "smart cruise control" that constantly adjusts the steering wheel smoothly while you drive. It doesn't stop to calculate; it just gently nudges the car in the right direction at every single moment.
The "Decoupled" Steering Wheel:
The magic of GPA is that it separates two things that were previously tied together:
- Where you look (Gradient): Where the robot is looking to see what's wrong.
- Where you go (Update): Where the robot actually moves.
  In older methods, these were locked together. If you wanted to look further ahead, you had to move differently. GPA uses two separate "knobs" (parameters). You can turn one knob to look further ahead without messing up the other knob that controls how you move. This gives the teacher much more flexibility to find the fastest path.
The "Exponential" Memory:
Instead of remembering the past equally (like Schedule-Free), GPA uses Exponential Moving Average.
- Analogy: Imagine you are learning a song.
  - Uniform Average: You remember the first note you played 100 times ago just as clearly as the note you played 1 second ago.
  - GPA (Exponential): You remember the note you played 1 second ago very clearly, the one from 2 seconds ago a little less, and the one from 100 seconds ago is fuzzy. This makes sense because the recent mistakes are usually more important for correcting your current path.

Why Does This Matter?

The paper shows that GPA is faster, cheaper, and easier to use than the previous best methods.

Faster Training: In tests with different-sized AI models (from small to huge), GPA reached the same level of intelligence in fewer steps.
- For a small model, it was about 8.7% faster.
- For a medium model, it was 10% faster.
- For a huge model, it was 9.6% faster.
- Real-world impact: If training a model usually takes 100 days, GPA might get it done in 90 days, saving millions of dollars in electricity and computer time.
Less Memory: DiLoCo needed to store extra copies of the robot's brain to do its "stop-and-check" routine. GPA is more efficient; it needs less memory, which means you can train bigger models on the same hardware.
Simpler Tuning: DiLoCo had a lot of knobs to turn (how many steps to take, how fast to move, etc.). GPA has fewer knobs, making it easier for engineers to use without needing a PhD in math to get it working.

The Bottom Line

The authors took a complex, stop-and-go training method (DiLoCo) and a slightly rigid averaging method (Schedule-Free) and fused them into a smooth, continuous, and highly efficient training process.

It's like upgrading from a runner who has to stop and check a map every few blocks to a runner with a GPS that gently guides them around every corner in real-time. The result? The AI learns faster, uses less energy, and gets smarter sooner.

1. Problem Statement

The pre-training of Large Language Models (LLMs) is a resource-intensive process, driving the need for more efficient optimizers. Two prominent recent approaches have emerged:

DiLoCo (Distributed Low-Communication): A method that accumulates multiple "inner" steps of a base optimizer (e.g., AdamW) to form a pseudo-gradient, then applies Nesterov momentum to update "outer" weights. While effective (even in single-worker settings), it suffers from a complex two-loop structure, high memory overhead (requiring storage of inner weights, outer weights, and momentum buffers), and unintuitive hyperparameter tuning (specifically, the number of inner steps $H$ ).
Schedule-Free Learning: An optimizer that uses primal averaging to interpolate between past weights and current weights, eliminating the need for manual learning rate schedules. However, it relies on uniform averaging (Polyak-Ruppert), which limits flexibility and performance in certain regimes compared to exponential moving averages.

The core problem is that while DiLoCo offers speedups, its discontinuous update mechanism (updating outer weights only periodically) creates a "choppy" information flow that is theoretically suboptimal, yet empirically effective. There is a lack of a unified framework that smooths these updates, reduces memory overhead, and simplifies hyperparameter tuning without sacrificing convergence speed.

2. Methodology: Generalized Primal Averaging (GPA)

The authors propose Generalized Primal Averaging (GPA), a novel optimizer that unifies and generalizes DiLoCo and Schedule-Free within a non-distributed, primal averaging framework.

Core Mechanism

GPA modifies the standard Nesterov primal averaging formulation by decoupling the interpolation constants for the gradient computation sequence and the model evaluation sequence. The update rules are:

Gradient Computation Point ( $y^{(t)}$ ):
$y^{(t)} = \mu_y x^{(t)} + (1 - \mu_y) z^{(t)}$
Here, $\mu_y$ controls the interpolation for where the gradient is evaluated.
Base Optimizer Step ( $z^{(t+1)}$ ):
$z^{(t+1)} = z^{(t)} - \gamma^{(t)} g(y^{(t)}; \xi^{(t)})$
The base optimizer (e.g., AdamW) updates the "unsmoothed" iterate $z$ .
Model Evaluation Point ( $x^{(t+1)}$ ):
$x^{(t+1)} = \mu_x x^{(t)} + (1 - \mu_x) z^{(t+1)}$
Here, $\mu_x$ controls the smoothing of the model weights used for evaluation.

Key Innovations

Decoupled Interpolation: Unlike standard Nesterov or Schedule-Free, which use a single momentum parameter $\mu$ for both sequences, GPA uses independent parameters $\mu_x$ and $\mu_y$ . This allows the method to smooth the evaluation sequence ( $x$ ) without altering the information flow into the gradient computation sequence ( $y$ ).
Exponential Moving Average (EMA): By replacing the uniform averaging of Schedule-Free with EMA (controlled by $\mu_x$ ), GPA achieves smoother convergence trajectories.
Memory Efficiency: A specific implementation of GPA stores only the $y^{(t)}$ and $z^{(t)}$ sequences, reconstructing $x^{(t)}$ on-the-fly during evaluation. This reduces the memory footprint compared to DiLoCo, which requires storing separate inner and outer weight buffers.
Smoothed DiLoCo Interpretation: GPA can be viewed as a "smoothed" version of single-worker DiLoCo. The authors derive a heuristic to map DiLoCo's inner steps ( $H$ ) and momentum ( $\mu$ ) to GPA's $\mu_x$ via the relation $\mu_x \approx \mu^{1/H}$ , effectively replicating DiLoCo's performance without the discrete inner-loop structure.

3. Key Contributions

Unified Framework: GPA generalizes Nesterov's method, Schedule-Free, and DiLoCo into a single primal averaging formulation with decoupled constants.
Algorithmic Simplification: It eliminates the two-loop structure of DiLoCo, reducing the number of hyperparameters to tune (from 4 in DiLoCo to 3 in GPA: learning rate, $\mu_x$ , $\mu_y$ ) and removing the need for a separate "inner step" count.
Memory Reduction: The memory-efficient implementation reduces the number of additional model copies required from 4 (in DiLoCo) to 3, lowering memory overhead significantly.
Theoretical Guarantees: The authors prove that for any base optimizer with $O(\sqrt{T})$ regret, GPA achieves convergence bounds that match or exceed the base optimizer. The bound includes negative Bregman divergence terms that can accelerate convergence if $\mu_x$ and $\mu_y$ are chosen appropriately.
Empirical Superiority: Extensive experiments show GPA consistently outperforms both AdamW and single-worker DiLoCo across various model sizes and modalities.

4. Experimental Results

The authors evaluated GPA on Large Language Models (Llama variants) and Computer Vision (ImageNet ViT) tasks.

Language Model Pre-training (Llama)

Llama-160M: GPA achieved an 8.71% speedup in steps to reach target validation loss compared to AdamW. It outperformed DiLoCo across all effective inner step configurations.
Llama-1B: GPA achieved a 10.13% speedup over AdamW.
Llama-8B (Code Generation): GPA consistently outperformed AdamW, achieving a lower final validation loss (1.865 vs 1.873).
Stability: Training curves for GPA were notably smoother and more stable than DiLoCo, allowing for the use of higher learning rates.

Computer Vision (ImageNet ViT)

Small Batch (4k): GPA achieved a 7% speedup over AdamW.
Large Batch (16k): GPA achieved a 25.5% speedup over AdamW.
In both settings, GPA outperformed DiLoCo and Schedule-Free in terms of final validation accuracy.

5. Significance

Practical Impact: GPA offers a "drop-in" replacement for standard optimizers that provides immediate speedups in LLM pre-training without the complexity of distributed communication protocols or multi-loop structures.
Theoretical Insight: The paper clarifies the relationship between DiLoCo's discrete inner steps and continuous smoothing. It demonstrates that the performance gains of DiLoCo can be achieved (and improved upon) through continuous exponential smoothing rather than periodic aggregation.
Future Directions: The decoupling of parameters opens new avenues for distributed training algorithms, potentially allowing for more flexible cross-regional training strategies where communication frequency and local smoothing can be tuned independently.

In summary, GPA represents a significant step forward in optimizer design, bridging the gap between the theoretical elegance of primal averaging and the practical efficiency of momentum-based methods like DiLoCo, resulting in faster, more stable, and memory-efficient training for large-scale models.

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Enter GPA: The "Smooth Operator"

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology: Generalized Primal Averaging (GPA)

Core Mechanism

Key Innovations

3. Key Contributions

4. Experimental Results

Language Model Pre-training (Llama)

Computer Vision (ImageNet ViT)

5. Significance

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields