Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

This paper presents a systematic layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 quantization across three Qwen2.5 model scales, revealing that MLP up- and down-projection layers are the most sensitive components while sensitivity patterns vary by format and model depth rather than being confined to final blocks.

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a massive, incredibly detailed library of knowledge (a Large Language Model, or LLM) that can write stories, solve math problems, and chat like a human. The problem is, this library is so huge it requires a warehouse full of supercomputers to run, costing a fortune in electricity and time.

To make this library portable, scientists are trying to shrink the books. They do this by quantization—a fancy word for "simplifying the numbers" inside the model. Instead of using high-definition, 32-bit or 16-bit numbers (like a 4K movie), they are trying to use 4-bit numbers (like a low-resolution thumbnail).

The paper you shared is about testing two specific ways of creating these "thumbnails" (called NVFP4 and MXFP4) to see which parts of the library break when you shrink them.

Here is the breakdown of their findings using simple analogies:

1. The Experiment: The "One-Thing-at-a-Time" Test

Imagine the AI model is a giant orchestra. To see which musicians are most sensitive to playing on cheap, low-quality instruments, the researchers didn't just swap the whole orchestra's gear at once. Instead, they did a controlled experiment:

  • They kept 6 out of 7 sections of the orchestra playing on high-quality instruments.
  • They forced only one section to play on the cheap, 4-bit instruments.
  • They listened to see how much the music (the AI's answer) sounded "off."

They did this for different sections (the "MLP" parts and the "Attention" parts) and for different sizes of orchestras (small, medium, and huge models).

2. The Big Discovery: The "Heavy Lifters" Break First

The researchers found that not all parts of the AI are equally fragile.

  • The MLP Layers (The Muscle): These are the parts of the AI that do the heavy lifting—processing information and making decisions.
    • The Finding: The "Up" and "Down" projection layers (the muscles that push and pull data) are extremely sensitive. If you put these on the cheap 4-bit format, the music sounds terrible.
    • The Analogy: Think of these as the engine of a car. If you try to run a Ferrari engine on low-grade, cheap fuel, it sputters and stops. You must keep the engine on high-quality fuel (high precision) for the car to work.
  • The Attention Layers (The Eyes): These parts help the AI focus on the right words in a sentence.
    • The Finding: These are surprisingly tough. They can handle the cheap 4-bit format much better than the muscles.
    • The Analogy: These are like the windshield. You can put a slightly cheaper, lower-resolution glass on the windshield, and the driver can still see the road just fine.

Key Takeaway: You don't need to treat the whole AI the same. You can use cheap, low-precision formats for the "eyes" (Attention) but must keep the "engine" (MLP) on high precision to avoid errors.

3. The Surprise: It's Not Just the End That Matters

For a long time, people thought that in a deep neural network, only the final layers (the very end of the process) mattered most. If the end was wrong, the whole answer was wrong.

  • The Finding: This paper proved that wrong! While the end is important, the beginning (early blocks) can be just as sensitive, especially with one of the formats (MXFP4).
  • The Analogy: Imagine building a house. Everyone thought only the roof mattered because if the roof leaks, the house is ruined. But this study found that if you build the foundation with cheap, weak bricks, the whole house collapses, even if the roof is perfect. Sometimes, the first few layers need to be built with high-quality materials, not just the last few.

4. The "Outlier" Mystery

The researchers looked at the data to see why the "Down" projection layer was so sensitive. They expected it to be because of "outliers"—extreme, crazy numbers that pop up occasionally (like a sudden, loud scream in a quiet room).

  • The Finding: Yes, the "Down" layer has crazy loud screams (outliers), which explains why it breaks easily. BUT, the "Up" layer was just as sensitive, even though it didn't have those loud screams.
  • The Analogy: It's like finding out that a car engine is overheating. You think it's because the driver is stomping on the gas (outliers). But then you realize the other engine is overheating too, even though the driver is driving calmly. This means there's a deeper mechanical issue we don't fully understand yet. It's not just about the "loud" numbers; the structure of the layer itself is fragile.

5. Does Size Matter?

They tested small models (0.5B), medium (7B), and huge (14B).

  • The Finding: Bigger models are generally more sensitive to errors (the music gets worse faster), but the order of sensitivity stays the same. The "muscles" are always the most fragile, and the "eyes" are always the most robust, regardless of how big the orchestra is.

Summary: What Does This Mean for the Future?

This paper is a diagnostic manual for building cheaper, faster AI.

Instead of trying to shrink the entire AI model to 4-bit (which breaks the engine), engineers can now use a hybrid approach:

  1. Keep the "muscle" parts (MLP) on high-quality precision.
  2. Shrink the "eye" parts (Attention) to the cheap 4-bit format.
  3. Pay extra attention to the beginning of the model, not just the end.

This allows us to run powerful AI on cheaper hardware without losing the ability to write good stories or solve hard problems. It's the difference between trying to fit a whole library into a shoebox (impossible) and carefully packing the most important books in a backpack while leaving the rest on a shelf (smart and efficient).