Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Imagine you have a massive, incredibly detailed library of knowledge (a Large Language Model, or LLM) that can write stories, solve math problems, and chat like a human. The problem is, this library is so huge it requires a warehouse full of supercomputers to run, costing a fortune in electricity and time.

To make this library portable, scientists are trying to shrink the books. They do this by quantization—a fancy word for "simplifying the numbers" inside the model. Instead of using high-definition, 32-bit or 16-bit numbers (like a 4K movie), they are trying to use 4-bit numbers (like a low-resolution thumbnail).

The paper you shared is about testing two specific ways of creating these "thumbnails" (called NVFP4 and MXFP4) to see which parts of the library break when you shrink them.

Here is the breakdown of their findings using simple analogies:

1. The Experiment: The "One-Thing-at-a-Time" Test

Imagine the AI model is a giant orchestra. To see which musicians are most sensitive to playing on cheap, low-quality instruments, the researchers didn't just swap the whole orchestra's gear at once. Instead, they did a controlled experiment:

They kept 6 out of 7 sections of the orchestra playing on high-quality instruments.
They forced only one section to play on the cheap, 4-bit instruments.
They listened to see how much the music (the AI's answer) sounded "off."

They did this for different sections (the "MLP" parts and the "Attention" parts) and for different sizes of orchestras (small, medium, and huge models).

2. The Big Discovery: The "Heavy Lifters" Break First

The researchers found that not all parts of the AI are equally fragile.

The MLP Layers (The Muscle): These are the parts of the AI that do the heavy lifting—processing information and making decisions.
- The Finding: The "Up" and "Down" projection layers (the muscles that push and pull data) are extremely sensitive. If you put these on the cheap 4-bit format, the music sounds terrible.
- The Analogy: Think of these as the engine of a car. If you try to run a Ferrari engine on low-grade, cheap fuel, it sputters and stops. You must keep the engine on high-quality fuel (high precision) for the car to work.
The Attention Layers (The Eyes): These parts help the AI focus on the right words in a sentence.
- The Finding: These are surprisingly tough. They can handle the cheap 4-bit format much better than the muscles.
- The Analogy: These are like the windshield. You can put a slightly cheaper, lower-resolution glass on the windshield, and the driver can still see the road just fine.

Key Takeaway: You don't need to treat the whole AI the same. You can use cheap, low-precision formats for the "eyes" (Attention) but must keep the "engine" (MLP) on high precision to avoid errors.

3. The Surprise: It's Not Just the End That Matters

For a long time, people thought that in a deep neural network, only the final layers (the very end of the process) mattered most. If the end was wrong, the whole answer was wrong.

The Finding: This paper proved that wrong! While the end is important, the beginning (early blocks) can be just as sensitive, especially with one of the formats (MXFP4).
The Analogy: Imagine building a house. Everyone thought only the roof mattered because if the roof leaks, the house is ruined. But this study found that if you build the foundation with cheap, weak bricks, the whole house collapses, even if the roof is perfect. Sometimes, the first few layers need to be built with high-quality materials, not just the last few.

4. The "Outlier" Mystery

The researchers looked at the data to see why the "Down" projection layer was so sensitive. They expected it to be because of "outliers"—extreme, crazy numbers that pop up occasionally (like a sudden, loud scream in a quiet room).

The Finding: Yes, the "Down" layer has crazy loud screams (outliers), which explains why it breaks easily. BUT, the "Up" layer was just as sensitive, even though it didn't have those loud screams.
The Analogy: It's like finding out that a car engine is overheating. You think it's because the driver is stomping on the gas (outliers). But then you realize the other engine is overheating too, even though the driver is driving calmly. This means there's a deeper mechanical issue we don't fully understand yet. It's not just about the "loud" numbers; the structure of the layer itself is fragile.

5. Does Size Matter?

They tested small models (0.5B), medium (7B), and huge (14B).

The Finding: Bigger models are generally more sensitive to errors (the music gets worse faster), but the order of sensitivity stays the same. The "muscles" are always the most fragile, and the "eyes" are always the most robust, regardless of how big the orchestra is.

Summary: What Does This Mean for the Future?

This paper is a diagnostic manual for building cheaper, faster AI.

Instead of trying to shrink the entire AI model to 4-bit (which breaks the engine), engineers can now use a hybrid approach:

Keep the "muscle" parts (MLP) on high-quality precision.
Shrink the "eye" parts (Attention) to the cheap 4-bit format.
Pay extra attention to the beginning of the model, not just the end.

This allows us to run powerful AI on cheaper hardware without losing the ability to write good stories or solve hard problems. It's the difference between trying to fit a whole library into a shoebox (impossible) and carefully packing the most important books in a backpack while leaving the rest on a shelf (smart and efficient).

Here is a detailed technical summary of the paper "Diagnosing FP4 Inference: A Layer-Wise and Block-Wise Sensitivity Analysis of NVFP4 and MXFP4."

1. Problem Statement

Large Language Models (LLMs) face significant challenges regarding memory bandwidth, storage, and computational costs. While quantization (reducing numerical precision) is a standard solution, the transition to 4-bit floating-point (FP4) formats introduces new complexities.

The Gap: Although hardware support for FP4 (specifically NVFP4 by NVIDIA and MXFP4 by AMD) is emerging in architectures like Blackwell and CDNA, there is a lack of comprehensive understanding regarding where and why FP4 quantization fails.
The Question: It is unclear which specific architectural components (e.g., attention vs. MLP layers) and which specific transformer blocks (early vs. late) are most sensitive to FP4 quantization. Furthermore, it is unknown if these sensitivities generalize across different model scales (e.g., 0.5B to 14B parameters) and different FP4 scaling mechanisms.

2. Methodology

The authors employ a controlled isolation methodology to diagnose sensitivity without the confounding factors of full-model quantization.

Models: Three scales of the Qwen2.5 architecture were used:
- 0.5B (24 layers)
- 7B (28 layers)
- 14B (48 layers)
Formats: Two distinct FP4 formats were compared:
- MXFP4: Uses microscaling with 32-element blocks and a shared 8-bit exponent.
- NVFP4: Uses dynamic scaling with 16-element blocks, 4-bit scales, and a max calibration algorithm.
Experimental Setup:
- Component Isolation: The authors quantized one specific projection type (Query, Key, Value, Output, Gate, Up, or Down) to FP4 across all layers while keeping the other six types in FP16.
- Block Isolation: The authors kept one specific block (all its components) in FP16 while quantizing all other blocks and components to FP4.
- Hardware: Experiments ran on NVIDIA RTX 5090 (smaller models) and RTX 6000 Pro (larger models).
- Metric: Perplexity (PPL) on the WikiText-2 dataset using 256 calibration samples. Lower PPL indicates better performance; "Improvement" is calculated as the PPL reduction when a specific component/block is kept in FP16.

3. Key Contributions

The paper makes four primary contributions to the understanding of FP4 inference:

Component Sensitivity Hierarchy: It establishes that MLP Up- and Down-projection layers are consistently the most sensitive components to FP4 quantization, forming a distinct "sensitive tier." In contrast, Gate projections are moderately sensitive, and Attention projections (Q, K, V, O) are substantially less sensitive.
Block Sensitivity Non-Uniformity: It challenges the assumption that only the final blocks are critical. The study finds that early blocks can be highly sensitive, particularly in smaller models (0.5B) under the MXFP4 format.
Outlier Analysis Limitations: It reveals that while extreme activation outliers (high Max/P99.9 ratios) in the Down projection correlate with high sensitivity, they do not fully explain the sensitivity. The Up projection is comparably sensitive despite having significantly lower outlier ratios.
Scale Invariance of Ranking: It demonstrates that while increasing model size (0.5B $\to$ 14B) increases the magnitude of sensitivity (larger models degrade more), the relative ranking of sensitive components remains stable across scales.

4. Key Results

A. Component Sensitivity

Dominance of MLP: Across all scales and formats, the Down and Up projections in the MLP layer caused the largest PPL degradation when quantized.
- Example (0.5B, MXFP4): Keeping the Up projection in FP16 improved PPL by +8.25, whereas keeping the Query projection in FP16 only improved it by +1.98.
Format Differences: MXFP4 generally showed higher sensitivity (larger PPL degradation) than NVFP4, but the relative ordering of components (Down/Up > Gate > Attention) remained consistent.
Attention Resilience: Attention projections (Q, K, V, O) were the least sensitive, suggesting they can be aggressively quantized with minimal impact.

B. Block Sensitivity

Depth Dependency: Sensitivity is structured, not uniform.
- Large Models (7B/14B): Sensitivity tends to concentrate in the final blocks (e.g., Block 27/28), aligning with traditional views of layer importance.
- Small Models (0.5B): Under MXFP4, early blocks (e.g., Blocks 0–3) exhibited strong sensitivity. Isolating these early blocks in FP16 yielded significant PPL improvements, challenging the "last-layer dominance" heuristic.
Negative Improvement: In some cases (e.g., 14B under MXFP4), isolating early blocks in FP16 actually resulted in negative improvement, suggesting that quantizing early blocks in these specific configurations might be less harmful than keeping them in high precision, or that the interaction effects are complex.

C. Activation Statistics

Down Projection: Exhibited extreme outlier behavior (Max/P99.9 ratios 10–100 $\times$ higher than other components), consistent with its high sensitivity.
Up Projection: Showed comparable sensitivity to the Down projection but had much lower outlier ratios. This indicates that outlier magnitude is not the sole predictor of FP4 sensitivity; other factors (such as the specific scaling mechanism of the FP4 format or the nature of the activation distribution) play a crucial role.

5. Significance and Implications

Diagnostic Framework: The paper provides a principled diagnostic view of FP4 inference, moving beyond "one-size-fits-all" quantization strategies.
Hardware-Aware Deployment: The findings suggest that deployment strategies must be component-aware and depth-aware. For instance, a hybrid precision strategy could keep MLP Up/Down projections in FP16 (or FP8) while aggressively quantizing Attention layers to FP4, optimizing for both accuracy and hardware efficiency.
Format Selection: The sensitivity differences between MXFP4 and NVFP4 imply that the choice of FP4 format should depend on the specific model architecture and the target hardware's scaling mechanisms.
Future Work: The authors suggest extending this analysis to downstream tasks (reasoning, coding) and native FP4 compute kernels, as perplexity alone may not capture all task-specific performance degradations.

In summary, this work establishes that FP4 sensitivity is highly non-uniform, dominated by MLP projections and varying significantly by block depth and model scale. These insights are critical for designing efficient, high-accuracy LLM inference pipelines on next-generation AI accelerators.