Deep Optimizer States: Towards Scalable Training of… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: The "Memory Wall"

Imagine you are trying to build a massive, intricate sandcastle (a Large Language Model or LLM) on a tiny beach (your GPU). The sandcastle is so huge that it requires billions of grains of sand (parameters).

The problem is that your beach (the GPU's memory) is too small to hold the entire sandcastle plus all the tools you need to build it (the optimizer state, which includes momentum, variance, and calculations).

To solve this, current methods (like DeepSpeed) say: "Okay, let's keep the sandcastle on the beach, but we'll put all the heavy tools in a warehouse across the street (the CPU/Host memory)."

The Bottleneck:
Every time you need to adjust the sandcastle, you have to:

Run across the street to the warehouse to grab a tool.
Run back to the beach to use it.
Run back to the warehouse to put it away.

The street between the warehouse and the beach is a narrow, slow dirt path (the PCIe link). Meanwhile, the beach workers (the GPU) are incredibly fast, but they spend most of their time standing around waiting for the tools to arrive. This is the "Memory Wall."

The Old Solution: The "Static Split"

The current best solution (called TwinFlow or ZeRO-Offload) tries to fix this by saying: "Let's keep some tools on the beach and leave the rest in the warehouse."

The Flaw: It's a static plan. If you decide to keep 20% of the tools on the beach, you keep them there forever.
The Waste: When the beach workers are busy building, the tools sitting on the beach are often just sitting there unused. Meanwhile, the workers in the warehouse are running back and forth on the slow dirt path, creating traffic jams. The beach is underutilized, and the path is clogged.

The New Solution: "Deep Optimizer States" (The Interleaved Approach)

The authors of this paper propose a smarter way to manage the tools. They call it Deep Optimizer States.

Instead of a static split, they use a dynamic, interleaved dance.

The Analogy: The Busy Kitchen

Imagine a high-end kitchen (the GPU) and a pantry (the CPU).

The Chef (GPU) is incredibly fast but has a tiny counter space.
The Prep Cook (CPU) is slower but has a huge pantry.
The Hallway (PCIe) connects them.

The Old Way: The Chef grabs a few ingredients, cooks, then stops to wait for the Prep Cook to bring more from the pantry. The Chef stands idle. The Prep Cook runs back and forth.

The Deep Optimizer States Way:
The team realizes that while the Chef is chopping vegetables (Forward Pass), the counter space is actually empty because the ingredients are already on the cutting board. While the Chef is cooking (Backward Pass), the counter is free again.

They introduce a Smart Scheduler that does three things:

The "Just-in-Time" Delivery: Instead of waiting for the Chef to finish a whole dish, the scheduler breaks the work into tiny chunks (subgroups). As soon as the Chef finishes one tiny chunk, the Prep Cook immediately starts prepping the next chunk in the pantry while the Chef starts the one after that.
The "Double-Task" Hallway: The hallway is used in both directions at the same time. While the Prep Cook runs ingredients to the kitchen (Host-to-Device), the Chef is simultaneously sending finished waste back to the pantry (Device-to-Host). They don't wait for the other to finish; they pass each other in the hallway.
The "Smart Switch": The system calculates exactly how many chunks the Chef should do versus how many the Prep Cook should do. It's not a fixed 20/80 split. It's a dynamic rhythm. If the hallway is clear, the Chef does more. If the hallway is busy, the Prep Cook does more.

The Secret Sauce: Precision and Timing

The paper also found two other clever tricks:

Don't Repack the Boxes: Usually, when moving tools from the warehouse to the beach, you have to repackage them from big boxes (FP32) to small bags (FP16) to fit them in the hallway. This takes time. The new method does the repacking inside the warehouse or right at the door, so the tools move through the hallway faster.
The Performance Model: They built a mathematical "traffic cop" (a performance model) that constantly watches the speed of the Chef, the speed of the Prep Cook, and the traffic in the hallway. It tells the system exactly how to split the work to ensure no one is ever standing idle.

The Results: A 2.5x Speed Boost

By using this "Interleaved Offloading" technique, the authors showed that:

The beach workers (GPU) are almost always busy.
The hallway (PCIe) is used to its full capacity in both directions.
The training process runs 2.5 times faster than the current state-of-the-art methods.

Summary

Think of Deep Optimizer States as upgrading a factory from a "stop-and-go" assembly line to a "continuous flow" assembly line. Instead of stopping to wait for parts, the system keeps the workers moving, the trucks moving, and the machines running by perfectly timing when to move parts between the fast machine and the slow warehouse. This allows us to train massive AI models on smaller, cheaper hardware much faster.

1. Problem Statement

The training of Large Language Models (LLMs) is increasingly constrained by the "memory wall." As models scale to hundreds of billions of parameters, the memory required to store model parameters, gradients, and optimizer states (specifically for adaptive optimizers like Adam, which require FP32 momentum and variance) often exceeds the aggregate GPU memory available, even when using 3D parallelism (data, tensor, and pipeline).

To address this, state-of-the-art frameworks (e.g., DeepSpeed ZeRO-Offload) offload the optimizer state to the host (CPU) memory. However, this introduces two critical bottlenecks:

PCIe Bandwidth Saturation: The limited bandwidth between CPU and GPU (typically 25–50 GB/s) creates a bottleneck for moving data. Existing approaches often suffer from poor overlap between data movement and computation.
CPU Computational Latency: CPUs are significantly slower than GPUs at updating parameters. For example, in the authors' testbed, GPUs update ~100B parameters/second, while CPUs update only ~8B parameters/second.
Suboptimal Resource Utilization: Current hybrid approaches (like DeepSpeed TwinFlow) statically partition the optimizer between CPU and GPU. This leads to periods where the GPU is idle during the update phase (waiting for CPU updates) or PCIe links are underutilized during forward/backward passes.

2. Methodology: Deep Optimizer States

The authors propose Deep Optimizer States, a middleware solution that dynamically interleaves optimizer updates between the CPU and GPU to maximize resource utilization.

Key Design Principles

Interleaved Offloading: Instead of statically assigning specific optimizer subgroups to the CPU or GPU, the system dynamically schedules updates. It leverages the observation that GPU memory utilization fluctuates significantly during the training cycle (high during forward/backward passes due to activations, low during the update phase).
Fine-Grained Subgroup Scheduling: Building on DeepSpeed's ZeRO-3, which partitions the optimizer into subgroups, the system schedules updates for these subgroups on either the CPU or GPU.
- While the CPU computes updates for a set of subgroups, the GPU simultaneously updates a different set.
- Data movement (prefetching next subgroups to GPU, flushing updated parameters to CPU) is overlapped with computation using asynchronous CUDA streams.
Efficient Gradient Management:
- The system stores gradients for GPU-scheduled subgroups directly in GPU memory (using space freed by activations) to avoid unnecessary transfers.
- Precision Optimization: Instead of transferring FP16 gradients from GPU to CPU and converting them to FP32 on the host (which involves slow memory allocation and conversion), the system performs on-the-fly FP16-to-FP32 conversion on the GPU (at ~1.2 TB/s) and transfers the FP32 gradients to the host. This avoids the bottleneck of unpinned memory allocation and sequential conversion on the CPU.

Performance Model & Scheduling Algorithm

The authors introduce a performance model to determine the optimal "update stride" ( $k$ ), defined as the ratio of CPU updates to GPU updates (e.g., update $k$ subgroups on CPU, then 1 on GPU).

The model balances the time required for:
- CPU update and downscaling ( $S/U_c + S/D_c$ ).
- GPU update ( $S/U_g$ ).
- PCIe transfers (D2H and H2D) for swapping subgroups ( $3S/B$ ).
The algorithm (Algorithm 1) dynamically schedules subgroups based on this calculated $k$ , ensuring that while the CPU is working, the GPU is either computing or transferring data, and vice versa.

3. Key Contributions

Observation of Fluctuations: The paper identifies that GPU memory and PCIe utilization are highly non-uniform across training phases (Forward, Backward, Update), creating opportunities for dynamic offloading that static approaches miss.
Novel Performance Model: A mathematical model to calculate the optimal CPU-to-GPU update ratio ( $k$ ) to maximize overlap between computation and data movement, independent of subgroup size.
Middleware Implementation: Implementation of Deep Optimizer States as a plug-in for DeepSpeed and Megatron-LM. It handles complex orchestration, including asynchronous data movement via dedicated CUDA streams and managing tensor lifecycles to avoid Python GIL bottlenecks.
Precision Transfer Optimization: A technique to convert FP16 gradients to FP32 on the GPU before transfer, significantly accelerating the gradient flushing process compared to host-side conversion.

4. Experimental Results

The system was evaluated on a node with 4× NVIDIA H100 GPUs (80GB each) and 192 CPU cores, training models ranging from 7B to 20B parameters.

Iteration Speedup: Deep Optimizer States achieved 2.5× faster iteration times compared to the state-of-the-art DeepSpeed ZeRO-3 (fully CPU offloaded) and significantly outperformed DeepSpeed TwinFlow (static hybrid offloading).
Update Throughput: The approach increased update throughput by 70% on average compared to ZeRO-3, reaching up to 15.4 billion parameters per second for a 20B model.
End-to-End Training: The speedup was consistent in end-to-end training time, confirming that overlapping transfers did not cause I/O stalls in subsequent iterations.
Resource Utilization:
- GPU Utilization: Increased from ~30% (ZeRO-3) to nearly 100% during the update phase.
- PCIe Utilization: Increased to ~40% of peak throughput, effectively utilizing the interconnect bandwidth.
- TFLOPs: Achieved 75 TFLOPs (vs. ~30 TFLOPs for ZeRO-3).
Scalability: The approach maintained a 2.5× speedup even when scaling data parallelism degrees, demonstrating efficiency at scale.
Model Robustness: The performance model correctly predicted the optimal stride ( $k=2$ ) across different hardware configurations (tested on V100s as well), proving platform independence.

5. Significance

Enabling Training on Resource-Constrained Nodes: This technique allows users to train moderately large LLMs (up to 20B parameters) on single nodes with limited GPU memory, which is crucial for fine-tuning tasks where full-scale pre-training clusters are unavailable.
Bridging the CPU-GPU Gap: By dynamically interleaving tasks, the system mitigates the massive performance disparity between CPU and GPU update speeds, turning the CPU from a bottleneck into a parallel contributor.
Future-Proofing: The authors note that next-generation systems (e.g., NVIDIA Grace Hopper with 200 GB/s C2C interconnects) will benefit even more from this dynamic interleaving, as the bandwidth will allow for even finer-grained and faster offloading strategies.
General Applicability: While demonstrated on Transformers, the middleware is designed to be generic and applicable to other large-scale models (e.g., vision models) and frameworks beyond DeepSpeed.

In summary, Deep Optimizer States transforms the training update phase from a sequential, bottlenecked process into a highly parallel, overlapping workflow, effectively breaking the memory wall for LLM training on single-node, resource-constrained setups.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading