Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Imagine you have a massive, incredibly detailed library of knowledge (a Large Language Model, or LLM) that can write stories, solve math problems, and chat with you. This library is currently written in FP16, which is like using high-quality, heavy, gold-plated bricks. It's accurate, but it's heavy, expensive to store, and slow to move around.

To make this library easier to carry and faster to use, researchers have been trying to replace those gold bricks with tiny, lightweight bricks. For a long time, the best "lightweight" bricks were INT4 (4-bit integers). They were small and fast, but sometimes the library lost a bit of its "soul" (accuracy) because the bricks were too rigid.

Recently, hardware giants like NVIDIA and AMD introduced a new type of brick called Microscaling FP4. Think of these as "smart" lightweight bricks. Instead of just being a number, they come in little groups (like a box of 16 or 32 bricks) that share a single "size tag" (a scale). The promise was that these smart bricks would be lighter than gold but just as accurate.

The Problem: The Promise vs. The Reality
The authors of this paper went into the warehouse to test these new "smart bricks" (MXFP4 and NVFP4). They found a disappointing gap between the marketing hype and reality:

The "NVFP4" Box (16 bricks): It's a bit more precise, but the standard way of packing these bricks (using simple rounding) actually made the library worse at certain tasks. It was like trying to fit a square peg in a round hole; the standard tools just didn't work well with this specific box shape.
The "MXFP4" Box (32 bricks): This one is even lighter, but it uses a very rough "size tag" (rounding sizes to powers of two). This caused a lot of "crumbling" in the library's accuracy. It was like trying to measure a delicate cake with a ruler that only has inch marks—you lose all the fine details.

Basically, the new hardware was ready, but the software (the algorithms) to use it was broken. If you just plugged these new formats into existing tools, your AI would start making silly mistakes.

The Solution: MR-GPTQ (The Master Packer)
The authors realized that to make these new bricks work, you can't just use the old packing methods. You need a new strategy. They invented a new algorithm called MR-GPTQ (Micro-Rotated-GPTQ).

Here is the analogy for how it works:

The "Rotation" Trick: Imagine the data in the AI model is a messy pile of sand. Some grains are huge (outliers), and most are tiny. The new "smart bricks" (especially the MXFP4 ones) struggle with the huge grains.
- The authors use a mathematical trick called a Hadamard Transform (think of it as a magical blender). They "blend" the data so that the huge grains get chopped up and mixed evenly with the tiny ones. Now, instead of having a few giant boulders and a pile of sand, you have a uniform pile of gravel.
- This makes the "smart bricks" much happier because they don't have to deal with extreme outliers anymore.
The "Custom Fitting": They also tweaked the "size tags" (scales) specifically for these new formats. Instead of using a generic tag, they calculated the perfect tag for every single box of bricks to minimize waste.

The Result: Speed and Accuracy
They didn't just write the algorithm; they built the actual "machinery" (GPU kernels) to run it on the newest NVIDIA Blackwell chips (like the B200 and RTX 5090).

Speed: Because the new bricks are so small, the AI can process them much faster. They achieved speedups of up to 3.6x to 6x compared to the old heavy gold bricks, with almost no extra cost for the "blending" trick.
Accuracy: By using their new "Master Packer" (MR-GPTQ), they fixed the accuracy issues. The AI using these new lightweight bricks now performs almost as well as the heavy gold version, and in some cases, even better than the old lightweight methods.

In a Nutshell
The paper says: "The new 4-bit floating-point hardware is amazing, but the software to run it was broken. We fixed the software by 'blending' the data and custom-fitting the packaging. Now, you can get massive speed boosts without losing the intelligence of the AI."

It's like taking a new, super-lightweight electric car engine that was too jittery to drive, tuning the suspension and fuel injection, and suddenly realizing it's not only faster but smoother than the old gas engine.

1. Problem Statement

Recent hardware advancements (NVIDIA Blackwell, AMD CDNA4) have introduced microscaling 4-bit floating-point formats (specifically MXFP4 and NVFP4) to accelerate Large Language Model (LLM) inference. These formats promise higher accuracy than traditional 4-bit integer (INT4) quantization by using shared scales for blocks of elements.

However, the paper identifies a critical gap: State-of-the-art (SOTA) quantization methods fail to unlock the potential of these formats.

MXFP4 suffers from severe accuracy degradation (up to ~10% relative drop) because its scale quantization (E8M0, power-of-two only) introduces high error.
NVFP4 (E4M3 scales) performs better but still lags behind INT4 when using standard methods.
The Core Issue: Existing techniques like Round-to-Nearest (RTN) and standard GPTQ are ill-suited for these specific formats. Specifically, the small group sizes (16 for NVFP4, 32 for MXFP4) neutralize traditional outlier mitigation techniques, and the power-of-two scaling of MXFP4 creates "dead zones" that destroy accuracy for heavy-tailed distributions (common in LLM weights/activations).

2. Methodology: MR-GPTQ

The authors propose Micro-Rotated-GPTQ (MR-GPTQ), a novel variant of the GPTQ algorithm specifically tailored to the unique properties of FP4 microscaling formats. The methodology consists of three key technical ingredients:

A. Analytical Foundation

The authors first analyze quantization error under two distribution models:

Native Distributions: Weights/activations follow a Laplace distribution (heavy-tailed, prone to outliers).
Rotated Distributions: After Hadamard rotation, these tensors approximate a Normal (Gaussian) distribution.

Finding: For small group sizes, Hadamard rotation hurts NVFP4 (which preserves outliers well natively) but helps MXFP4 (which struggles with outliers). This necessitates a unified approach that adapts to the format.

B. Algorithmic Innovations

MR-GPTQ integrates the following optimizations:

Block-wise Hadamard Transforms (Micro-Rotation):
- Applies Hadamard rotations to "normalize" weights and activations, converting heavy-tailed distributions into Gaussian-like distributions.
- This reduces the quantization error for MXFP4 significantly.
- Crucially, this is done in a block-wise fused manner to avoid runtime overhead.
MSE-Optimized Grids & Scale Search:
- Instead of using fixed scales, MR-GPTQ performs an alternating optimization to find the optimal tensor-level and group-level scales ( $s_T, s_G$ ) that minimize Mean Squared Error (MSE).
- For MXFP4, it utilizes a static scale fitting strategy to map the data range effectively within the coarse E8M0 grid.
Static Activation Re-ordering:
- Standard GPTQ uses dynamic column re-ordering based on Hessian diagonal entries, which incurs a 10-20% inference slowdown.
- MR-GPTQ performs this re-ordering statically during the quantization phase (pre-computation), shuffling columns before quantization and un-shuffling them afterward. This retains the accuracy benefits without runtime penalties.

C. System Implementation: QuTLASS

To support MR-GPTQ without performance loss, the authors developed QuTLASS, a high-performance GPU kernel library for NVIDIA Blackwell (SM100/SM120):

Fused Online Rotation: Implements lightweight kernels for the online rotation of activations ( $X H_k$ ), fusing the rotation with quantization and scale calculation.
Efficiency: The rotation is memory-bound for small block sizes ( $k < 256$ ), meaning it adds negligible overhead compared to standard matrix multiplication.
Backend Agnostic: Supports CUTLASS and FlashInfer backends.

3. Key Contributions

First Comprehensive Study: A rigorous analysis of MXFP4 and NVFP4 accuracy, revealing that standard methods (RTN, SmoothQuant, QuaRot) often underperform or fail to outperform simple INT4 baselines.
MR-GPTQ Algorithm: A new quantization algorithm that combines Hadamard rotation, MSE-optimal scale search, and static re-ordering to bridge the accuracy gap.
QuTLASS Library: A production-ready kernel suite that enables MR-GPTQ on Blackwell GPUs with negligible overhead, achieving near-ideal speedups.
Theoretical Insights: Proven analytical bounds showing that Hadamard transforms invert the MSE performance gap between Laplace and Normal distributions as group size increases, explaining why rotation helps MXFP4 but hurts NVFP4 under standard RTN.

4. Experimental Results

The authors evaluated MR-GPTQ on Llama-3.1-8B, Llama-3.3-70B, and Qwen-3 models across standard zero-shot tasks (MMLU, GSM8K, etc.).

Accuracy:
- MXFP4: Standard RTN yields 88% recovery. MR-GPTQ boosts this to **93-94%**, nearly closing the gap with NVFP4.
- NVFP4: MR-GPTQ achieves ~96-97% recovery, matching or slightly exceeding INT4 baselines.
- Large Models: For 70B+ models, MR-GPTQ recovers 98-99% of the FP16 baseline accuracy.
- Comparison: MR-GPTQ consistently outperforms RTN, SmoothQuant, and SpinQuant, particularly on MXFP4 where it is the only method to achieve high accuracy.
Performance (Speedups):
- NVIDIA B200: Achieves 3.6x layer-wise speedup and 2.2x end-to-end speedup over FP16.
- RTX 5090: Achieves 6x layer-wise speedup and 4x end-to-end speedup over FP16.
- Overhead: The "micro-rotation" component adds negligible overhead; in some cases, MXFP4 with MR-GPTQ outperforms ideal NVFP4 matrix multiplication due to larger group sizes reducing memory overhead.

5. Significance

This paper fundamentally shifts the understanding of 4-bit LLM quantization:

Format Specialization is Key: It demonstrates that "one-size-fits-all" quantization strategies (like standard GPTQ) are insufficient for emerging microscaling formats. Algorithms must be co-designed with the hardware format's mathematical properties (e.g., scale granularity, group size).
Unlocking Hardware Potential: While the hardware (Blackwell/CDNA4) supports FP4, the software stack was the bottleneck. MR-GPTQ + QuTLASS proves that FP4 can be a viable, high-accuracy alternative to INT4, offering better throughput and energy efficiency.
Future Direction: The work suggests that future quantization research must account for the specific "dead zones" and scale quantization artifacts of microscaling formats, moving beyond simple integer-based heuristics.

In conclusion, the paper bridges the gap between the theoretical promise of FP4 hardware and practical deployment, providing a robust algorithmic and systems-level solution (MR-GPTQ + QuTLASS) that enables high-accuracy, high-speed inference on next-generation GPUs.

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

1. Problem Statement

2. Methodology: MR-GPTQ

A. Analytical Foundation

B. Algorithmic Innovations

C. System Implementation: QuTLASS

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Robust Multi-agent Communication via Multi-view Message Certification

DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Forecasting Supply Chain Disruptions with Foresight Learning

UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression