Imagine you are trying to fit a massive, 397-billion-piece Lego castle (a giant AI model) into a small backpack (a standard workstation computer).

Normally, to make the castle fit, you try to smash the bricks down into tiny, uniform dust (standard compression). But this ruins the castle; the details get blurry, and the AI starts making mistakes.

XFP is a new, smarter way to pack that castle. Instead of smashing everything, it uses a "Quality-First" approach. Here is how it works, using simple analogies:

1. The "Quality Floor" Instead of "Bit Width"

In the old way, you tell the computer: "Pack this into 4 bits per brick." The computer does its best, but if the bricks are weird shapes, the result is bad.

XFP flips this. You tell the computer: "I need the castle to look at least 96% like the original."
The computer then figures out the rest on its own. It asks, "Okay, to keep 96% quality, how many bits do I actually need for this specific part?"

For some parts, it might only need 2 bits.
For other parts, it might need 4 bits.
It doesn't force a one-size-fits-all rule.

2. The "Outlier" Problem (The Giant Bricks)

Imagine that in your Lego castle, 99% of the bricks are small and normal, but 1% are giant, heavy, weird-shaped bricks.

Old Method: You try to squash the giant bricks to fit the small box. This distorts the whole box, making the small bricks look bad too.
XFP Method: It says, "Let's take those giant, weird bricks out and put them in a special, separate pocket."
- The Special Pocket holds the giant bricks in their original, high-quality form (using a bit more space, but only for the few that need it).
- The Main Box only holds the 99% of normal bricks, which are now easy to compress because they are all similar.

3. The "Custom Dictionary" (The Codebook)

Instead of using a standard dictionary where every word is the same length, XFP learns a custom dictionary for every single layer of the AI.

It looks at the bricks in a specific room of the castle and learns exactly what shapes appear most often.
It creates a tiny list (a "codebook") of just those shapes.
Then, instead of storing the whole brick, it just stores a tiny number (an index) pointing to that shape in the list.
Because the list is custom-made for that specific room, it's incredibly efficient.

4. The Two Rules (Strict vs. Lazy)

The AI has different types of rooms:

The "Strict" Rooms: (Like the attention mechanism, which pays attention to details). XFP is very careful here. It uses a high-quality dictionary and keeps more giant bricks in the special pocket.
The "Lazy" Rooms: (Like the "routed experts," which are specialized workers). These rooms are more flexible. XFP uses a smaller, cruder dictionary here because these parts of the AI can tolerate being squished more without breaking.

This allows XFP to save massive amounts of space without ruining the AI's brain.

5. The "H-Process" (The Tight Squeeze)

The paper describes a specific challenge: fitting a 397-billion-parameter model onto two standard graphics cards (which usually can't hold it).

They used a process called the H-Process.
Think of it like a game of "Goldilocks." They kept adjusting the "Strict" and "Lazy" rules.
- If they made the rules too strict, the model wouldn't fit in the backpack (Out of Memory).
- If they made the rules too loose, the AI started talking nonsense (Garbage Output).
They found the "Just Right" setting (called H1.5) where the model fits perfectly, runs fast, and still gives good answers.

The Results

Speed: On a high-end workstation computer, XFP runs 49% faster than the current best standard method (Marlin INT4) for a 122-billion model.
Quality: It keeps the AI's intelligence almost exactly the same as the original, uncompressed version.
Accessibility: It allows researchers to run massive AI models on standard, powerful desktop computers without needing expensive, data-center supercomputers.

In short: XFP is a smart packing system that doesn't force everything into a uniform box. It separates the weird, heavy items, learns custom dictionaries for the rest, and lets you decide how much quality you want to keep, automatically figuring out the most efficient way to fit it all in.

Technical Summary: XFP – Quality-Targeted Adaptive Codebook Quantization

1. Problem Statement

The paper addresses the limitations of current Large Language Model (LLM) quantization strategies when deploying massive Mixture-of-Experts (MoE) models on workstation-class hardware (e.g., NVIDIA RTX PRO 6000 Blackwell). The authors identify three structural problems in existing approaches:

The Outlier Problem: Transformer weight matrices exhibit heavy-tailed distributions. Uniform linear quantization (INT-N) or fixed logarithmic codebooks (like NVFP4) must accommodate extreme outliers, forcing a single scale factor that degrades resolution for the dense central bulk of weights.
The MoE Heterogeneity Problem: Different components within MoE models (attention, shared experts, routed experts) possess distinct weight distributions. Routed experts are empirically more tolerant of coarse quantization than attention layers. A single global quality floor or bit-width fails to exploit this asymmetry.
The Codebook-Storage Problem: Per-channel learned codebooks consume significant Shared Memory (SMEM) on GPUs. On workstation-class GPUs (99 KB SMEM/CTA), this limits the activation-row cache size, forcing "split-K" operations that increase latency and limiting the maximum model size deployable in a single stream.

Furthermore, the paper notes that while NVIDIA's NVFP4 format offers hardware-accelerated throughput on datacenter Blackwell chips (SM100+), this path is unavailable or incomplete on workstation hardware (SM120/121), leaving operators with formats that ignore specific weight distributions.

2. Methodology: The XFP Framework

XFP inverts the conventional quantization workflow. Instead of the operator selecting a target bit-width and accepting the resulting quality loss, the operator specifies reconstruction quality floors based on per-channel cosine similarity. XFP then automatically determines the necessary codebook size, outlier budget, and packing strategy for each layer.

Core Mechanisms

Sparse Outlier Separation:
Weights are decomposed into a sparse high-precision residual and a dense bulk.
- Outliers are defined as weights where $|w - \mu| > k\sigma$ (default $k=4.0$ , capped at 2% of weights).
- These outliers are stored as a sparse fp16 matrix at their original positions.
- The remaining "bulk" weights are replaced by their group mean and quantized.
Adaptive Learned Codebooks:
The bulk distribution is quantized using a per-group learned codebook via Lloyd iteration.
- V2 Mode (Per-Channel): Stores one independent fp16 codebook per output channel. Supports bit widths $N \in \{2, 3, 4, 5, 6\}$ .
- V2a Mode (Shared Library): Stores a small library of $L$ codebooks (default $L=32$ ) per layer. Each weight group is assigned the best-fitting library entry. This reduces SMEM usage from $O(\text{output\_dim})$ to a constant $\sim 1$ KB, enabling larger context windows on workstation GPUs.
Quality-Targeted Auto-Selection (Two-Threshold):
The system uses two distinct cosine similarity thresholds to drive bit-width selection:
- Strict Floor ( $\tau_{strict}$ ): Applied to attention and shared-expert projections.
- Lazy Floor ( $\tau_{lazy}$ ): Applied to routed-expert MoE paths.
- The algorithm iterates through candidate bit widths ( $N$ ) and selects the minimum $N$ that satisfies the relevant threshold. This allows routed experts to use coarser quantization (e.g., $N=2$ ) while attention layers use finer quantization (e.g., $N=4$ ) without manual intervention.
Fused Decode Kernel:
A custom kernel fuses four operations: depacking indices, gathering from the codebook, scattering sparse outliers, and performing the GEMM. It operates natively in bf16 and caches activation rows in SMEM to minimize global memory bandwidth usage.

3. Key Contributions

The paper outlines three primary contributions:

(C1) The XFP Quantizer: A dynamic, quality-targeted quantizer that converges to effective bit widths of $\sim 3.97$ on Qwen3.5-122B and $\sim 3.4$ on Qwen3.5-397B under the "H-Process" (a constrained compression iteration). It requires no Hessian computation, no calibration data, and no manual bit-width selection.
(C2) Dual Storage Modes with Unified Engine: Two storage modes (V2 and V2a) share a single auto-select frontend and fused decode kernel. This covers the 30B–397B parameter range on workstation GPUs without relying on NVFP4 Tensor Cores.
(C3) The H-Process: A quality-driven iteration over the two cosine thresholds that successfully fits the 397B Qwen3.5-A17B model (512 routed experts/layer) into 2×96 GB of VRAM. It achieves this at $\sim 3.4$ effective bits while retaining the full expert population, avoiding the need for expert pruning.

4. Experimental Results

Experiments were conducted on workstation-class hardware (RTX PRO 6000 Blackwell, DGX Spark) using Qwen3.5 and GLM-4.7-Flash models.

Qwen3.5-122B-A10B (V2 Mode):
- Achieved 138 tokens/s single-stream decode on TP=2.
- 49% faster than Marlin INT4 (AutoRound) at TP=1.
- Maintained 94.49% GSM8K strict-match (3 seeds), comparable to 4-bit baselines.
- Effective bit width: $\sim 3.97$ .
Qwen3.5-397B-A17B (H-Process, V2 Mode):
- Successfully deployed on 2×96 GB GPUs.
- Achieved 100.9 tokens/s on long-output decode.
- Achieved 66.72% GSM8K strict-match on the full 1,319-problem set (single seed).
- Effective bit width: $\sim 3.4$ .
- Comparison with REAP (expert pruning): XFP+H retained all experts and achieved higher accuracy (66.72% vs 33–44% for pruned INT4) with a lower or comparable memory footprint.
V2a Mode Probes:
- Demonstrated throughput gains of +47% over Marlin INT4 on Qwen3.5-122B, validating the shared-library approach for memory-constrained scenarios.

5. Significance and Claims

The paper positions XFP not merely as a quantization format, but as a mechanism for appliance-class deployment. Its significance lies in:

Inverting the Workflow: Shifting the operator's role from selecting bit-widths to defining quality constraints, allowing the system to adapt automatically to model heterogeneity.
Workstation Viability: Enabling the deployment of massive MoE models (up to 397B parameters) on consumer/workstation hardware (96 GB GPUs) without datacenter-specific Tensor Core support (NVFP4) or expert pruning.
Quality-Driven Compression: Demonstrating that "quality" (cosine similarity) is a more effective steering metric than fixed bit-widths, particularly for MoE models where different components have vastly different quantization tolerances.
Hardware Agnosticism: Providing a software-defined throughput advantage on hardware where dedicated quantization paths (NVFP4) are unavailable or incomplete, outperforming the most optimized INT4 kernels (Marlin) by reading fewer bytes per token.

The authors explicitly state that the trade-off involves per-group codebook overhead and encoding-time Lloyd iterations, which favor XFP on the workstation tier for memory-bound workloads, though the choice remains a deployment decision rather than a universal benchmark verdict.

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference