Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

This paper introduces Micro-Rotated-GPTQ (MR-GPTQ), a specialized quantization algorithm that overcomes the accuracy limitations of current FP4 formats (MXFP4 and NVFP4) through block-wise Hadamard transforms and format-specific optimizations, achieving significant speedups on modern GPUs while matching or exceeding state-of-the-art performance.

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you have a massive, incredibly detailed library of knowledge (a Large Language Model, or LLM) that can write stories, solve math problems, and chat with you. This library is currently written in FP16, which is like using high-quality, heavy, gold-plated bricks. It's accurate, but it's heavy, expensive to store, and slow to move around.

To make this library easier to carry and faster to use, researchers have been trying to replace those gold bricks with tiny, lightweight bricks. For a long time, the best "lightweight" bricks were INT4 (4-bit integers). They were small and fast, but sometimes the library lost a bit of its "soul" (accuracy) because the bricks were too rigid.

Recently, hardware giants like NVIDIA and AMD introduced a new type of brick called Microscaling FP4. Think of these as "smart" lightweight bricks. Instead of just being a number, they come in little groups (like a box of 16 or 32 bricks) that share a single "size tag" (a scale). The promise was that these smart bricks would be lighter than gold but just as accurate.

The Problem: The Promise vs. The Reality
The authors of this paper went into the warehouse to test these new "smart bricks" (MXFP4 and NVFP4). They found a disappointing gap between the marketing hype and reality:

  1. The "NVFP4" Box (16 bricks): It's a bit more precise, but the standard way of packing these bricks (using simple rounding) actually made the library worse at certain tasks. It was like trying to fit a square peg in a round hole; the standard tools just didn't work well with this specific box shape.
  2. The "MXFP4" Box (32 bricks): This one is even lighter, but it uses a very rough "size tag" (rounding sizes to powers of two). This caused a lot of "crumbling" in the library's accuracy. It was like trying to measure a delicate cake with a ruler that only has inch marks—you lose all the fine details.

Basically, the new hardware was ready, but the software (the algorithms) to use it was broken. If you just plugged these new formats into existing tools, your AI would start making silly mistakes.

The Solution: MR-GPTQ (The Master Packer)
The authors realized that to make these new bricks work, you can't just use the old packing methods. You need a new strategy. They invented a new algorithm called MR-GPTQ (Micro-Rotated-GPTQ).

Here is the analogy for how it works:

  • The "Rotation" Trick: Imagine the data in the AI model is a messy pile of sand. Some grains are huge (outliers), and most are tiny. The new "smart bricks" (especially the MXFP4 ones) struggle with the huge grains.

    • The authors use a mathematical trick called a Hadamard Transform (think of it as a magical blender). They "blend" the data so that the huge grains get chopped up and mixed evenly with the tiny ones. Now, instead of having a few giant boulders and a pile of sand, you have a uniform pile of gravel.
    • This makes the "smart bricks" much happier because they don't have to deal with extreme outliers anymore.
  • The "Custom Fitting": They also tweaked the "size tags" (scales) specifically for these new formats. Instead of using a generic tag, they calculated the perfect tag for every single box of bricks to minimize waste.

The Result: Speed and Accuracy
They didn't just write the algorithm; they built the actual "machinery" (GPU kernels) to run it on the newest NVIDIA Blackwell chips (like the B200 and RTX 5090).

  • Speed: Because the new bricks are so small, the AI can process them much faster. They achieved speedups of up to 3.6x to 6x compared to the old heavy gold bricks, with almost no extra cost for the "blending" trick.
  • Accuracy: By using their new "Master Packer" (MR-GPTQ), they fixed the accuracy issues. The AI using these new lightweight bricks now performs almost as well as the heavy gold version, and in some cases, even better than the old lightweight methods.

In a Nutshell
The paper says: "The new 4-bit floating-point hardware is amazing, but the software to run it was broken. We fixed the software by 'blending' the data and custom-fitting the packaging. Now, you can get massive speed boosts without losing the intelligence of the AI."

It's like taking a new, super-lightweight electric car engine that was too jittery to drive, tuning the suspension and fuel injection, and suddenly realizing it's not only faster but smoother than the old gas engine.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →