Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan (NVIDIA), Hongxiao Bai (NVIDIA), Xin Yao (NVIDIA), Dennis Liu (NVIDIA), Tong Liu (NVIDIA), Hongbin Liu (NVIDIA), Pingtian Li (NVIDIA), Evan Wu (NVIDIA), Shiqing Fan (NVIDIA), Li Tao (NVIDIA), Robin Zhang (NVIDIA), Yuzhong Wang (NVIDIA), Shifang Xu (NVIDIA), Jack Chang (NVIDIA), Xuwen Chen (NVIDIA), Kunlun Li (NVIDIA), Yan Bai (NVIDIA), Gao Deng (NVIDIA), Nan Zheng (NVIDIA), Vijay Anand Korthikanti (NVIDIA), Abhinav Khattar (NVIDIA), Ethan He (NVIDIA), Soham Govande (NVIDIA), Sangkug Lym (NVIDIA), Zhongbo Zhu (NVIDIA), Qi Zhang (NVIDIA), Haochen Yuan (NVIDIA), Xiaowei Ren (NVIDIA), Deyu Fu (NVIDIA), Tailai Ma (NVIDIA), Shunkang Zhang (NVIDIA), Jiang Shao (NVIDIA), Ray Wang (NVIDIA), Santosh Bhavani (NVIDIA), Xipeng Li (NVIDIA), Chandler Zhou (NVIDIA), David Wu (NVIDIA), Yingcan Wei (NVIDIA), Ashwath Aithal (NVIDIA), Michael Andersch (NVIDIA), Mohammad Shoeybi (NVIDIA), Jiajie Yao (NVIDIA), June Yang (NVIDIA)

Published Tue, 10 Ma

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to build a massive, super-intelligent brain (a Large Language Model) to solve complex problems. In the past, to make this brain smarter, you just made it bigger and heavier, like adding more bricks to a wall. But this made the brain incredibly slow and expensive to train because every single thought required every single brick to work.

Mixture-of-Experts (MoE) is a revolutionary new way to build this brain. Instead of one giant brain, imagine a team of 100 specialized consultants (Experts). When you ask a question, a Manager (the Router) doesn't wake up all 100 consultants. Instead, they pick the top 2 or 3 experts who are best at that specific topic and only ask them to work.

This is amazing because the brain can be huge (millions of experts) but only uses a tiny fraction of them for each question. It's like having a library with a million books, but you only pull out the two you need for the moment.

However, training this system is a logistical nightmare. NVIDIA's new paper, "Scalable Training of Mixture-of-Experts Models with Megatron Core," is essentially the ultimate operations manual for running this massive team of consultants efficiently. They call the problems they solved the "Three Walls."

Here is how they broke through those walls, explained simply:

1. The Memory Wall: "The Fitting Room Problem"

The Problem: Even though only a few consultants work at a time, all 100 consultants must be present in the room (the GPU memory) just in case they are needed. If you have a million experts, the room gets so crowded with their resumes and tools that you can't fit them all in.
The Solution:

Compression: They taught the experts to carry lighter backpacks (using FP8/FP4 precision). Instead of carrying heavy, detailed blueprints, they carry simplified sketches that are 50% to 75% smaller but still get the job done.
The "Do-It-Yourself" Trick: Instead of storing every single step of the experts' work in the room, they throw the notes away and ask the experts to re-calculate the steps when needed later. It takes a tiny bit more time to re-calculate, but it saves a massive amount of space.
The Attic: For the stuff they don't need right now, they move it to the attic (CPU memory) and only bring it down when absolutely necessary.

2. The Communication Wall: "The Traffic Jam"

The Problem: Since the Manager picks different experts for different questions, the consultants are scattered across different buildings (GPUs). When a question comes in, the Manager has to run around, grab the right consultants, bring them to the work table, and then send them back. If you have thousands of buildings, the time spent running around (communication) becomes longer than the time spent actually working.
The Solution:

Super-Highways: They built specialized, ultra-fast roads (DeepEP and HybridEP) specifically for moving these consultants. These roads are optimized so that moving a consultant takes almost no time.
Multitasking: They figured out how to make the consultants start working on the next question while the previous ones are still being delivered. It's like a chef starting to chop vegetables while the delivery driver is still unloading the groceries. This hides the travel time completely.

3. The Compute Efficiency Wall: "The Idle Workers"

The Problem: Because the experts are so specialized, they often get very small tasks. Imagine a master carpenter being asked to just hammer one nail. They spend most of their time waiting for the next instruction, and the computer's "boss" (the CPU) gets overwhelmed trying to give out thousands of tiny instructions every second. The workers sit idle, and the boss is stressed.
The Solution:

Batching: Instead of giving one nail to one carpenter, they group 100 carpenters together and give them a whole wall to build at once. This keeps everyone busy.
The "Script" (CUDA Graphs): Instead of the boss shouting instructions one by one ("Hammer! Sand! Paint!"), they write the instructions down on a script once. Then, the workers just follow the script automatically without waiting for the boss to speak. This removes the "shouting" delay.
Smart Scheduling: They use a system called Parallel Folding. Imagine a factory where the assembly line for "Painting" (Attention layers) and the assembly line for "Woodworking" (MoE layers) used to be forced to use the same number of workers. Now, they can have a huge team for woodworking and a small team for painting, or vice versa, depending on what the factory needs that day. This ensures no one is ever standing around doing nothing.

The Result: A Super-Factory

By fixing these three problems, NVIDIA created a system that can train models with trillions of parameters (like DeepSeek-V3 and Qwen3) incredibly fast.

On the newest super-computers (GB300/GB200): They are achieving speeds that were previously thought impossible, processing data so fast it's like watching a movie in fast-forward.
The "Secret Sauce": The paper emphasizes that you can't just fix one problem. If you fix the memory but ignore the traffic, you still fail. You have to fix all three at the same time, like tuning a race car's engine, tires, and aerodynamics simultaneously.

In a nutshell: This paper is the blueprint for how to build a massive, distributed team of AI experts without them getting lost in traffic, running out of office space, or sitting around waiting for instructions. It turns a chaotic, slow process into a sleek, high-speed machine.

Here is a detailed technical summary of the paper "Scalable Training of Mixture-of-Experts Models with Megatron Core."

1. Problem Statement

Training large-scale Mixture-of-Experts (MoE) models introduces systemic challenges distinct from dense Transformer models due to sparsity. While MoE allows total parameters to grow much faster than per-token computation (e.g., DeepSeek-V3 has 685B total parameters but only 37B active per token), this creates a Parameter-Compute Mismatch.

This mismatch manifests as three tightly coupled "Walls" that constrain training efficiency:

The Memory Wall: All $E$ experts' parameters, gradients, and optimizer states must reside in memory, even though only $K$ activate per token. This creates massive memory pressure, often exceeding GPU capacity.
The Communication Wall: Expert Parallelism (EP) requires all-to-all communication to route tokens to specific experts. As EP scales, this traffic often crosses node boundaries (low bandwidth) rather than staying within NVLink domains (high bandwidth), and the sparse activation pattern offers little computation to overlap with this latency.
The Compute Efficiency Wall: Fine-grained experts result in many small GEMMs that underutilize GPU Tensor Cores. Additionally, dynamic routing creates host-side synchronization overhead (CPU bottlenecks) and kernel launch latency, leaving GPUs idle.

Furthermore, traditional parallelism strategies (Tensor, Pipeline, Data) fail for MoE because Attention layers (dense) and MoE layers (sparse) have conflicting optimal parallelism configurations. Traditional frameworks force a single configuration, leading to suboptimal performance.

2. Methodology & Architecture

The paper presents Megatron-Core MoE, a production-ready training stack designed to break these three walls through integrated system-level co-design.

A. Parallel Folding and Multi-Dimensional Parallelism

To address the Dense-Sparse Mismatch, the authors introduce Parallel Folding.

Decoupling: Instead of forcing Attention and MoE layers to share the same parallelism configuration, Parallel Folding decouples them. Attention layers can use high Tensor Parallelism (TP) and Context Parallelism (CP) for large matrices and long sequences, while MoE layers use high Expert Parallelism (EP) with minimal TP (ETP=1) to preserve expert GEMM efficiency.
Breaking Constraints: This breaks the traditional constraint where $EP \le DP$ . EP can now "fold" across arbitrary sub-groups of the attention parallelism configuration, allowing for flexible, topology-aware mapping that minimizes cross-node communication.

B. Breaking the Three Walls

The framework employs specific optimizations for each bottleneck:

1. Memory Wall Solutions:

Memory-Efficient Permutation: Absorbs routing weights into activations to eliminate redundant intermediate tensors (zero overhead).
Reduced-Precision Training: Uses FP8/FP4 for activations and weights to halve or quarter memory usage.
Fine-Grained Recomputation: Selectively recomputes specific modules (e.g., LayerNorm, MLP up-projections) rather than full layers, trading minimal compute for significant memory savings.
Activation Offloading: Transfers activations to CPU memory asynchronously, hiding PCIe latency behind computation.
FSDP for MoE: Implements a Dual DeviceMesh architecture to shard parameters and optimizer states specifically within Expert Data Parallel (EDP) groups, reducing memory footprint without complex pipeline balancing.

2. Communication Wall Solutions:

Optimized Dispatchers (DeepEP & HybridEP): Replaces standard NCCL all-to-all with token-based dispatch kernels that fuse permutation and communication, maximizing bandwidth utilization (especially on NVLink-rich topologies like NVL72).
Communication-Computation Overlap: Implements a 1F1B (Forward-Backward) schedule with W/D Split (splitting weight and data gradients) to hide all-to-all latency behind expert computation. This reduces EP communication overhead from ~40% to <5% of iteration time.

3. Compute Efficiency Wall Solutions:

Grouped GEMM & Kernel Fusion: Batches small expert GEMMs and fuses routing, permutation, and auxiliary loss calculations into single kernels to improve GPU utilization.
CUDA Graphs: Captures static components of the training loop to eliminate Python/host overhead.
Sync-Free Execution for Dropless MoE: To enable full CUDA Graphs for dynamic routing (where token counts vary), the system uses:
- Device-Initiated Kernels: Kernels read shape info directly from GPU memory, removing host-device sync.
- ECHO (Elastic Cloning for Hot Experts): Dynamically clones popular experts to balance load and reduce memory fragmentation.
- Paged Stashing: Manages memory within the graph to avoid worst-case buffer allocation, reducing fragmentation.

C. Reduced-Precision Training (FP8/FP4)

The paper details Selective Precision strategies:

Router & Critical Components: Kept in FP32/BF16 to ensure routing stability and convergence.
Bulk Computation: Expert GEMMs and activations use FP8 (Blockwise/MXFP8) or FP4 (NVFP4).
Primary Weights: Eliminates redundant BF16 copies by casting directly from FP32 master weights to FP8/FP4, reducing memory and communication volume.

D. Long-Context & RL Support

Long-Context: Introduces Dynamic Context Parallelism (Dynamic-CP) for packed sequences. It adaptively selects the CP degree per micro-batch based on sequence length to balance memory and compute, avoiding the inefficiency of static CP on variable-length data.
Reinforcement Learning (RL): Supports variable-length sequences, online weight export, and Router Replay (replaying inference routing decisions during training to stabilize on-policy updates).

3. Key Contributions

Parallel Folding: A novel framework that decouples parallelism mappings for dense and sparse layers, enabling optimal configurations for both simultaneously.
Integrated System Optimization: A holistic approach addressing Memory, Communication, and Compute walls simultaneously, recognizing their interdependencies (e.g., how FP8 reduces memory but increases CPU overhead, necessitating CUDA Graphs).
Sync-Free Dropless MoE: Techniques (Device-initiated kernels, ECHO, Paged Stashing) that enable full CUDA Graph coverage for dynamic routing without token dropping, eliminating host bottlenecks.
Production-Grade Features: Distributed checkpointing (parallelism-agnostic), Upcycling (dense-to-MoE conversion), and support for Multi-Token Prediction (MTP) and Latent MoE.
Open-Source Implementation: All techniques are implemented in Megatron-Core, providing a scalable solution for models ranging from billions to trillions of parameters.

4. Performance Results

The framework was evaluated on DeepSeek-V3 (685B) and Qwen3-235B across NVIDIA GB300, GB200, and H100 platforms.

DeepSeek-V3 (685B):
- GB300 (256 GPUs): 1,233 TFLOPS/GPU (MXFP8).
- GB200 (256 GPUs): 1,048 TFLOPS/GPU (MXFP8).
- H100 (1024 GPUs): 368 TFLOPS/GPU (Blockwise FP8).
Qwen3-235B:
- GB300 (256 GPUs): 974 TFLOPS/GPU.
- GB200 (256 GPUs): 919 TFLOPS/GPU.
Long-Context: Achieved 1,150 TFLOPS/GPU on Qwen3-235B with a sequence length of 131,072 tokens on GB300.
Efficiency Gains: The optimizations reduced EP communication overhead from ~40% to <5% and reduced per-GPU memory requirements for DeepSeek-V3 from ~199.5 GB (unoptimized) to fit within 80 GB GPUs.

5. Significance

This paper establishes a new standard for training trillion-parameter MoE models. By solving the fundamental "Three Walls" through co-design rather than isolated optimizations, Megatron-Core MoE enables:

Scalability: Efficient training on clusters of thousands of GPUs.
Hardware Utilization: Maximizing the potential of next-gen hardware (Blackwell/GB300) via native FP4/FP8 support and optimized kernels.
Accessibility: Providing an open-source, production-ready stack that allows researchers and industry to train massive models without hitting memory or communication bottlenecks.
Flexibility: Supporting diverse architectures (Latent MoE, Shared Experts) and training paradigms (Pre-training, SFT, RL) with a unified optimization strategy.

The work demonstrates that with the right system-level design, the theoretical efficiency gains of MoE can be realized in practice at the largest scales.