Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Imagine you are trying to build a massive, super-smart library (a Large Language Model) that can answer any question. To make this library efficient, you don't have one giant librarian who knows everything; instead, you hire thousands of specialized experts (like a historian, a coder, a poet, and a chef). This is called a Mixture-of-Experts (MoE) model.

However, there's a problem: To get the right answer, you have to send a question to all these experts, wait for their notes, and then combine them. This creates a huge traffic jam in the library's hallways (communication) and requires a massive amount of desk space to hold all those notes (memory).

The Problem: The "Hopper" Library is Old (But Powerful)

Your library is built on Hopper GPUs, which are incredibly fast computers. But they have a specific rule: they are great at handling "FP8" (a standard size for notes) and "BF16" (a very large, safe size), but they don't have a special machine to handle "FP4" (tiny, compressed notes).

Newer computers (like Blackwell) have a special machine for FP4, but most people are still using Hopper. Without that special machine, trying to use FP4 notes usually means:

Writing the note in FP4.
Expanding it back to a huge BF16 note just to read it.
Shrinking it back to FP8 to do the math.
Expanding it again.

This "round-trip" is like packing a suitcase, unpacking it to put on a scale, repacking it, and then unpacking it again just to walk through a door. It's slow and wastes energy.

The Solution: A Smart "Compression" Trick

The authors of this paper figured out how to use FP4 notes on Hopper computers without that slow round-trip. They created a new "training recipe" that acts like a masterful logistics manager.

Here is how they did it, using simple analogies:

1. The "Backpack" Strategy (Memory Savings)

Imagine the experts are writing their notes on giant whiteboards (Memory).

Old Way: They write in big, clear letters (FP8). The whiteboards get full quickly, so you can't fit many experts at once.
New Way: The authors invented a way to write the notes in tiny, compressed shorthand (FP4) only when the notes are being passed between experts or stored for later.
The Magic: They don't expand the notes to read them. Instead, they built a special "translator" (a software kernel) that can read the tiny shorthand directly and convert it into a format the computer's math engine understands, skipping the messy middle steps.
Result: You can fit 50% more notes on the same whiteboard. This means the library can handle bigger questions or more experts without running out of space.

2. The "One-Way Street" (Forward vs. Backward)

In training a model, there are two directions:

Forward Pass (The Delivery): Sending the question to the experts.
Backward Pass (The Correction): Checking the answers and fixing mistakes.

The authors realized that for the Forward Pass, using the tiny FP4 notes saves so much time and space that it's worth the effort. But for the Backward Pass, the "translation" cost was too high. So, they made a smart compromise:

Forward: Use the tiny, compressed FP4 notes (Super fast!).
Backward: Stick to the standard, slightly larger FP8 notes (Safe and stable).

This is like using a bicycle to deliver mail (fast, efficient) but using a truck to return the empty boxes (safe, reliable). This "hybrid" approach gave them the best of both worlds.

3. The "Direct Translation" (No Middleman)

The biggest technical hurdle was converting the FP4 notes to FP8 math without using the "BF16" middleman.

The Old Way: FP4 $\to$ BF16 $\to$ FP8. (Like translating French to English, then English to Spanish).
The New Way: FP4 $\to$ FP8. (Direct translation).

They wrote a custom "dictionary" (a bitwise conversion algorithm) that maps the tiny FP4 bits directly to the FP8 bits. It's like having a secret code where you can instantly swap a "1" for a "2" without writing out the whole word first. This saved a massive amount of time.

The Results: A Faster, Bigger Library

When they tested this on a massive model with 671 billion parameters (think of it as a library with 671 billion books):

Memory: They saved 14.8% of the memory space. This is like finding an extra room in a crowded house without building an addition.
Speed: They trained 12.5% faster. The library could process more questions per second.
Quality: The model learned just as well as the standard methods. It didn't get "confused" by the tiny notes.

The Bottom Line

This paper shows that you don't need to wait for brand-new, expensive hardware to get the benefits of ultra-efficient computing. By being clever with software—creating smart translators, using compression only where it helps, and skipping unnecessary steps—you can make current, powerful computers (Hopper GPUs) run massive AI models faster and cheaper.

It's a reminder that sometimes, the best way to move faster isn't to buy a faster car, but to take a smarter route.

1. Problem Statement

Training large-scale Mixture-of-Experts (MoE) models (e.g., 671B parameters) faces two primary bottlenecks:

Activation Memory: The memory required to store intermediate activations for backpropagation is massive.
Expert-Parallel Communication: The "All-to-All" (A2A) communication required to route tokens between experts consumes significant bandwidth.

While FP4 (4-bit floating point) formats like MXFP4 offer theoretical reductions in memory and bandwidth, they are currently impractical on NVIDIA Hopper GPUs (the dominant current-generation accelerator) because:

Hopper lacks native FP4 Tensor Core support (unlike the newer Blackwell architecture).
Existing training pipelines rely on FP8 or BF16. Naively inserting FP4 requires costly precision round-trips (e.g., FP4 $\leftrightarrow$ BF16 $\leftrightarrow$ FP8), which introduces latency, memory overhead, and potential numerical instability.
MXFP4 uses block-wise power-of-two scaling (32 elements/block) and E2M1 encoding, which is structurally incompatible with Hopper's FP8 pipeline (128 elements/block, E4M3 encoding).

2. Methodology

The authors propose a hybrid-precision MXFP4 training framework that enables FP4-level efficiency on Hopper GPUs through software emulation and careful dataflow design, without requiring hardware changes.

Core Design Principles

Asymmetric Precision Strategy:
- Forward Pass: Activations are aggressively quantized to MXFP4 immediately before the All-to-All (A2A) dispatch. This reduces communication volume and caches activations in FP4 format to save HBM memory.
- Backward Pass: The system reverts to standard FP8 for gradient computation and communication. Profiling showed that the overhead of de-quantizing gradients for FP4 communication outweighed the bandwidth savings, whereas FP8 provides sufficient stability for backward propagation.
Decoupled Compute and Storage:
- Compute: All GEMM (General Matrix Multiply) operations remain in FP8 to leverage native Hopper Tensor Cores.
- Storage/Communication: Activations and expert-parallel traffic are stored/transmitted in MXFP4.

Key Technical Innovations

Direct Bitwise FP4-to-FP8 Conversion:
- Instead of converting FP4 $\to$ BF16 $\to$ FP8 (which is slow), the authors designed a direct bit-wise conversion algorithm.
- Bit-level Remapping: Unpacks sign, exponent, and mantissa from FP4 (E2M1) and directly maps them to FP8 (E4M3) using integer arithmetic.
- Hierarchical Scaling Alignment: Addresses the mismatch in block sizes (32 for FP4 vs. 128 for FP8). It selects a target scale from four source FP4 blocks and adjusts exponents hierarchically to preserve numerical equivalence without floating-point exponentiation.
Specialized CUDA Kernels:
- BF16ToFP4Row: Performs row-wise quantization and packing into uint8 containers.
- FP4RowToFP8Row: De-quantizes packed FP4 activations back to FP8 for standard GEMM.
- FP4RowToFP8Col (Fused): A critical kernel for weight gradient ( $W_{grad}$ ) computation. It fuses de-quantization, matrix transposition, and re-quantization into a single step, eliminating intermediate global memory writes and reducing bandwidth consumption by ~3x compared to naive baselines.
Layout-Aware Optimization:
- The framework handles "ragged" MoE tensors (variable token counts per expert) natively.
- It generalizes scale handling to support byte-addressable integer scales (UE8M0) rather than assuming FP32 scales, optimizing buffer allocation for RDMA and NVL transfers.

3. Key Contributions

First Production-Scale FP4 on Hopper: Demonstrated the first successful deployment of software-emulated MXFP4 for MoE training on Hopper GPUs without native 4-bit support.
Novel Conversion Algorithm: Introduced a loss-neutral, direct FP4-to-FP8 conversion algorithm that eliminates BF16 intermediates and aligns hierarchical scaling factors.
Optimized Kernel Suite: Developed fused CUDA kernels that handle layout transformations (transposition) and quantization simultaneously, significantly reducing memory traffic.
Communication Strategy: Reduced expert-parallel communication payload by ~50% by packing two FP4 values into a single byte and using 8-bit integer scales.

4. Experimental Results

Experiments were conducted on a 671B parameter MoE model (DeepSeek-V3 configuration) using a 32-node cluster (256 Hopper GPUs).

Memory Efficiency:
- Reduced peak activation memory by 14.8% (saving 11.8 GB) compared to strong FP8 baselines.
- This memory headroom allowed the system to relax the recomputation strategy (checkpointing only the MLA Up-Projection), which caused Out-of-Memory (OOM) errors in BF16 and FP8 baselines.
Training Throughput:
- Achieved 1,302 tokens per GPU per second (TGS), a 12.5% improvement over the best feasible FP8 baseline (1,157 TGS).
- Compared to BF16, the speedup was 16.0%.
Convergence:
- The method achieved convergence strictly comparable to BF16 and FP8 baselines.
- Loss deviation was minimal (+0.61% relative to BF16), proving numerical stability despite aggressive quantization.
Kernel Performance:
- Fused kernels achieved 1.6x–1.9x speedup over naive sequences of separate operations.
- For MoE expert layers (the compute-heavy path), the custom implementation was 1.43x–1.53x faster than the standard Transformer Engine FP8 baseline, effectively amortizing the overhead seen in standard linear layers.

5. Significance

This work bridges the gap between the theoretical efficiency of FP4 and the reality of current hardware deployments.

Immediate Impact: It allows organizations using existing Hopper GPU clusters to train larger models or use larger batch sizes without waiting for Blackwell hardware.
System Design Insight: It demonstrates that software-hardware co-design (specifically, asymmetric precision flows and fused kernels) can overcome hardware limitations.
Scalability: The method validates that FP4 is viable for trillion-parameter scale models, offering a clear path to further reduce the memory and bandwidth bottlenecks that currently limit the scaling of MoE architectures.