Pretraining Large Language Models with NVFP4

This paper introduces a novel NVFP4 training framework that combines Random Hadamard transforms, 2D quantization, stochastic rounding, and selective high-precision layers to successfully pretrain a 12-billion-parameter model on 10 trillion tokens with performance comparable to FP8 baselines, thereby demonstrating the viability of stable 4-bit precision training for large language models.

NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu

Published 2026-03-06
📖 6 min read🧠 Deep dive

Imagine you are trying to teach a super-smart robot (a Large Language Model, or LLM) to understand the entire internet, write poetry, solve math problems, and code software. To do this, you need to show it trillions of examples.

In the past, teaching this robot required a massive, energy-hungry "brain" that used 8-bit precision (like a high-definition camera). It was powerful, but it was expensive, slow, and ate up a lot of electricity.

NVIDIA's new paper introduces a way to teach this robot using 4-bit precision (NVFP4). Think of this as switching from a 4K camera to a highly optimized, ultra-efficient 1080p camera. You might think, "Won't the picture look blurry? Won't the robot make mistakes?"

The answer is: Not if you use the right tricks.

Here is the simple breakdown of how they made this work, using some everyday analogies.

1. The Problem: The "Tiny Box" Issue

Imagine you are trying to pack a giant, heavy suitcase (the data) into a tiny backpack (the 4-bit format).

  • The Old Way (MXFP4): You try to force everything in. If the suitcase has a few really heavy items (outliers), they crush the rest of the clothes, or the backpack rips. The robot gets confused because the "heavy" numbers get squished into zero or the maximum limit.
  • The New Way (NVFP4): NVIDIA designed a smarter backpack. Instead of one big compartment, they broke the backpack into smaller, specialized pockets.
    • Smaller Pockets: They use smaller blocks of data (16 items instead of 32). This means the "heavy" items don't crush as many "light" items.
    • Better Labels: They use a more precise label system (E4M3) to describe how heavy the items are, rather than just guessing with powers of two. This ensures the backpack fits the data perfectly without tearing.

2. The Four Magic Tricks

Just having a better backpack isn't enough. To make the robot learn as well as it did with the big 8-bit camera, they added four specific "training techniques":

A. The "Safety Net" (Mixed Precision)

Imagine you are teaching a student to do complex math. You let them do 90% of the work with a cheap calculator (4-bit), but for the final, most critical steps (like the last few pages of a thesis), you let them use a super-precise, expensive calculator (BF16/FP32).

  • Why? The beginning and middle of the learning process are robust, but the very end is delicate. If you use the cheap calculator for the final steps, the student might get the answer slightly wrong. By keeping the last few layers in high precision, the robot gets the best of both worlds: speed and accuracy.

B. The "Shuffle" (Random Hadamard Transforms)

Imagine you have a deck of cards where a few cards are "Jokers" (outliers) that are huge and weird. If you deal them out, they mess up the game.

  • The Trick: Before dealing the cards, you shuffle the deck using a special mathematical "shuffle" (Hadamard transform). This spreads the "Jokers" out so they aren't clumped together. Now, when the robot looks at the data, the weird numbers look like normal numbers, and the 4-bit format can handle them easily.
  • Note: They found this shuffle is only needed for the "backward pass" (when the robot checks its mistakes), not the "forward pass" (when it makes predictions).

C. The "Consistent Map" (2D Scaling)

Imagine you are giving directions to a friend.

  • The Problem: In the morning, you tell them to turn "Left" based on the street name. In the evening, when they retrace their steps, you tell them to turn "Right" based on the building number. If the map changes between morning and night, your friend gets lost.
  • The Fix: In AI, the "forward pass" and "backward pass" look at data differently (rows vs. columns). If the 4-bit format treats them differently, the robot gets confused. NVIDIA created a 2D block scaling method. It's like drawing a grid on the map so that no matter which way you look at it, the "Left" and "Right" instructions remain consistent. This keeps the robot from getting lost during learning.

D. The "Fair Coin Flip" (Stochastic Rounding)

Imagine you have a jar of marbles, and you need to round the number of marbles to the nearest whole number.

  • The Problem: If you always round 1.4 down to 1, and 1.6 up to 2, you introduce a bias. Over millions of calculations, the robot drifts in one direction.
  • The Fix: Instead of always rounding down, you flip a coin. If you have 1.4, you round down 60% of the time and up 40% of the time. This "randomness" ensures that, on average, the robot stays perfectly centered and doesn't drift off course. This is crucial for the "gradient" (the error signal) that tells the robot how to learn.

3. The Results: Did it Work?

NVIDIA tested this on a massive 12-billion-parameter model, training it on 10 trillion tokens (a huge amount of data).

  • The Comparison: They compared their 4-bit robot to a standard 8-bit robot.
  • The Outcome: The 4-bit robot performed almost identically to the 8-bit one.
    • On a test called MMLU-Pro (a hard exam for AI), the 8-bit robot got 62.62%. The 4-bit robot got 62.58%.
    • The loss curve (a measure of how well the robot is learning) was almost a perfect match.

The Bottom Line

NVIDIA has figured out how to shrink the "brain" of a super-intelligent AI by half (from 8-bit to 4-bit) without making it dumber.

  • Why does this matter? It means we can train the next generation of AI models twice as fast, use half the memory, and save a massive amount of energy.
  • The Analogy: It's like discovering a way to fly a jumbo jet using half the fuel, without losing any speed or safety.

This isn't just a small tweak; it's a major leap forward that makes the future of AI faster, cheaper, and more accessible.