Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao

Published 2026-03-04

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

The Big Problem: The "High-Res" vs. "Low-Bandwidth" Dilemma

Imagine you have a Master Chef (the AI model) who can cook incredible, complex dishes. This chef has a massive kitchen with every tool imaginable (High Precision/BF16). They can taste a dish and adjust the seasoning with microscopic precision.

However, you want to send this chef to a tiny, remote campsite where there is no electricity, and the kitchen only has a tiny, portable stove with limited tools (Low Precision/NVFP4).

If you just tell the chef, "Go cook on this tiny stove," the food comes out tasting bland or burnt. The chef is too used to the big kitchen; the tiny stove changes how they cook, and the quality drops. This is what happens when we shrink AI models to save memory and speed them up: they lose their "taste" (accuracy).

The Old Solution: "Re-Learning" (Quantization-Aware Training)

For a long time, the solution was to make the chef re-learn how to cook on the tiny stove from scratch.

The Analogy: You give the chef a cookbook and say, "Practice cooking on this small stove until you get it right."
The Problem: This is hard.
1. Complex Recipes: Modern AI models have gone through many "training stages" (learning math, then coding, then being polite, then learning to reason). Trying to re-train them on a small stove often messes up the skills they already learned. It's like trying to teach a master pianist to play jazz on a toy piano; they might forget how to play classical music.
2. Missing Ingredients: Sometimes, you don't have the original cookbook (training data) anymore. You only have a few scraps of paper.

The New Solution: "The Shadow Chef" (Quantization-Aware Distillation - QAD)

This paper introduces a smarter way called Quantization-Aware Distillation (QAD). Instead of making the student chef re-learn from a textbook, you pair them with a Shadow Chef.

Here is how it works:

The Setup: You have the Master Chef (the original, high-quality AI) and the Student Chef (the tiny, compressed AI).
The Task: You give them both the same ingredients (a prompt).
The Magic: The Master Chef doesn't just say, "Make a burger." Instead, the Master Chef whispers the exact flavor profile, the texture, and the perfect seasoning ratio into the Student Chef's ear.
The Goal: The Student Chef doesn't try to guess the answer from a textbook. They just try to mimic the Master Chef's output perfectly.

In technical terms, the paper uses something called KL Divergence. Think of this as a "Mimicry Score." The computer measures how closely the Student Chef's "flavor profile" matches the Master Chef's. The goal is to make that score zero.

Why This is a Game-Changer

The paper found three amazing things about this "Shadow Chef" method:

1. It Works Even for "Multi-Stage" Chefs

Modern AI models are like chefs who have taken classes in French cuisine, then Japanese, then Molecular Gastronomy.

Old Way (QAT): Trying to re-train the whole thing on a small stove often breaks the complex skills they learned in the later stages (like Reinforcement Learning).
New Way (QAD): Because the Student Chef is just copying the Master Chef's current output, it doesn't matter how complex the Master's training was. The Student just copies the final result. It's stable and reliable.

2. It Works Even with "Bad" Ingredients (Data Robustness)

Usually, to teach a chef, you need a perfect library of recipes.

The Surprise: The paper found that you can teach the Student Chef using random scraps of paper, or even just fake recipes generated by the Master Chef.
The Analogy: Even if you give the Student Chef a list of random words or a few math problems, as long as the Master Chef is whispering the "correct flavor" for those words, the Student learns the style of the Master. It doesn't need the whole library; it just needs to hear the Master's voice.

3. It Transfers Knowledge Across Fields

Imagine the Master Chef is famous for both Baking and Grilling.

You only give the Student Chef ingredients for Grilling.
The Magic: Because the Student is copying the Master's mindset and flavor profile, the Student accidentally gets really good at Baking too, even though they never practiced it!
The Result: The paper showed that a model trained only on code data could still solve math problems perfectly, because it was mimicking a Master Chef who knew both.

The "Secret Sauce" (Technical Details Simplified)

The Loss Function (The Scorecard): The paper proves that measuring "how close the flavors are" (KL Divergence) is much better than measuring "how close the final dish is to a textbook recipe" (Cross-Entropy). It's about capturing the soul of the dish, not just the ingredients.
The Learning Rate (The Pace): You have to teach the Student Chef at the right speed. If you go too fast, they get confused. If you go too slow, they forget. The paper found that for some models, you need to teach them very slowly (like a gentle whisper), while for others, a slightly faster pace works best.

The Bottom Line

This paper is like a guidebook for downsizing a luxury car into a compact car without losing the engine's power.

Instead of trying to rebuild the engine from scratch on a smaller frame (which is hard and risky), they built a "ghost engine" inside the compact car that perfectly copies the movements of the luxury engine.

The Result: You get a tiny, fast, energy-efficient AI (NVFP4) that tastes exactly like the giant, expensive one (BF16), and you can do it even if you don't have all the original training data. It's a "cheat code" for making AI faster without making it dumber.

1. Problem Statement

The rapid adoption of NVFP4 (NVIDIA's 4-bit floating-point format) offers significant advantages for LLM and VLM inference, including 2–3× higher arithmetic throughput and ~1.8× memory reduction compared to FP8. However, applying standard Post-Training Quantization (PTQ) to NVFP4 often results in significant accuracy degradation, particularly for:

Small to medium-sized models: Where the smaller block size (16 vs. 32) of NVFP4 neutralizes traditional outlier mitigation techniques.
Models with complex post-training pipelines: Modern LLMs undergo multi-stage training involving Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and model merging.
Limitations of Quantization-Aware Training (QAT): Traditional QAT attempts to recover accuracy by retraining the quantized model using the original task loss (e.g., cross-entropy). This approach faces two major hurdles:
1. Engineering Complexity: Replicating multi-stage pipelines (especially RL) with quantized forward passes is difficult and unstable.
2. Data Constraints: Original training data is often unavailable, and public datasets may be of lower quality.
3. Capability Degradation: For RL-trained models, retraining with SFT data can "break" the capabilities learned during the RL phase.

2. Methodology: Quantization-Aware Distillation (QAD)

The authors propose Quantization-Aware Distillation (QAD) as a robust alternative to QAT for NVFP4 recovery.

Core Concept: Instead of training the quantized student model to predict ground-truth tokens (like QAT), QAD trains the student to mimic the output probability distribution of the original full-precision (BF16) teacher model.
Loss Function: The training objective is the KL Divergence between the teacher's output distribution ( $p_{teacher}$ ) and the student's output distribution ( $p_{student}$ ):
$\mathcal{L}_{QAD} = D_{KL}(p_{teacher} \| p_{student}) = \sum_{y \in V} p_{teacher}(y|x) \log \frac{p_{teacher}(y|x)}{p_{student}(y|x)}$
Training Setup:
- Teacher: The original full-precision (BF16) model.
- Student: The NVFP4-quantized model.
- Data: Requires significantly less data than original post-training. It can utilize partial domain data, synthetic data generated from prompts, or even random tokens.
- Architecture: Supports selective quantization (e.g., keeping attention layers in BF16 for specific hybrid architectures like Mamba-Transformer) while quantizing GEMM layers to NVFP4.

3. Key Contributions

Superior Alignment with BF16: QAD aligns the quantized model's output distribution much closer to the BF16 baseline than QAT. While QAT may match validation loss (cross-entropy), it often shifts the model's internal distribution, effectively acting as a new post-training stage that alters capabilities. QAD preserves the original distribution.
Stability for Multi-Stage Pipelines: QAD is highly effective for models trained via complex pipelines (SFT + RL + Merging). It avoids the instability of replicating RL stages in a quantized setting.
Robustness to Data Coverage: QAD demonstrates remarkable resilience to incomplete data. It can recover full-domain performance even when trained on data from a single domain (e.g., training on code data to recover math capabilities) or synthetic data.
Practicality: It requires no access to original training data or complex RL pipelines, needing only the teacher model and a small calibration dataset.

4. Experimental Results

The authors evaluated QAD on several models, including Nemotron Nano (9B, 12B VL, 30B-A3B), Llama Nemotron Super V1, and AceReason Nemotron.

SFT-Heavy Models (e.g., Llama Nemotron Super V1, Nemotron Nano 9B V2):
- QAD consistently outperformed QAT on reasoning benchmarks (AIME25, GPQA-D).
- Example: On Llama Nemotron Super V1, QAD recovered AIME25 scores to 45.6 (vs. BF16 46.0), whereas QAT only reached 41.5.
RL-Heavy Models (e.g., Nemotron 3 Nano, AceReason):
- Critical Finding: QAT failed on RL-trained models, often degrading performance below the baseline PTQ (e.g., AceReason AIME24 dropped from 69.4 (PTQ) to 62.1 (QAT)).
- QAD successfully recovered near-BF16 performance (AceReason AIME24: 71.7 vs. BF16 73.0).
Cross-Domain Transfer:
- Training QAD on only code data for AceReason (a math/code model) still recovered strong math performance, proving the teacher's distribution encodes implicit cross-domain knowledge.
Data Robustness:
- QAD performed well even when trained on random tokens or incorrectly generated samples, showing extreme stability compared to the fragility of QAT.
Hyperparameters:
- Learning Rate: Optimal rates differ by model type. SFT-heavy models require lower rates (matching or slightly below original SFT LR), while RL-heavy models benefit from higher rates (e.g., $1e^{-5}$ ).
- Teacher Choice: Using the original model size as the teacher is superior to using a larger teacher from the same family.

5. Significance

This paper establishes QAD as the de facto standard for recovering accuracy in NVFP4 inference, particularly for modern, complex LLMs.

Enables NVFP4 Deployment: It solves the accuracy drop problem that previously hindered the adoption of 4-bit floating-point formats for smaller or RL-tuned models.
Simplifies Engineering: It removes the need for practitioners to replicate complex, multi-stage training pipelines (especially RL) for quantization, lowering the barrier to entry for deploying efficient models.
Data Efficiency: It allows for accuracy recovery using synthetic or partial data, addressing the growing scarcity of high-quality training data.

In conclusion, QAD offers a stable, data-efficient, and highly effective pathway to deploy NVFP4-quantized models with near-BF16 accuracy, making it a critical technique for the next generation of efficient AI inference.