BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Imagine you have a massive, incredibly detailed library (a Large Language Model) that knows everything from how to fix a car to how to write a poem. This library is so huge that it takes up an entire warehouse of space and requires a giant power plant to run.

To make this library fit into a small backpack (like a smartphone or a laptop) and run on battery power, you need to compress it. This is called Quantization. It's like taking a high-resolution 4K movie and compressing it into a low-resolution MP4 so it downloads faster.

However, there's a catch. When you compress these "smart" models too much, they start making silly mistakes. They might hallucinate facts or fail at simple math. This happens because of "Outliers"—weird, extreme numbers in the data that don't fit the pattern, like a single giant elephant in a room full of mice.

The Problem: The "Global Rotation" Mistake

Scientists tried to fix this by using a technique called Rotation. Imagine you have a room full of people (the data) standing in a grid. Some people are huge (outliers), and most are tiny. To make the room fit better, you try to spin the whole room 45 degrees.

The Goal: By spinning the room, you hope to spread the "huge people" out so they don't crowd one spot.
The Failure: In the new, ultra-efficient format the paper uses (called MXFP4), spinning the whole room actually makes things worse. It accidentally drags the "huge people" from one corner of the room into a corner that was previously empty and calm. Now, that quiet corner is suddenly crowded, and the new compression format can't handle it. It's like trying to pour a bucket of water into a cup; if you tilt the bucket too far, you spill water everywhere instead of filling the cup.

The Solution: BATQuant (The "Local Fixer")

The authors of this paper, BATQuant, realized that spinning the whole room was the wrong move. Instead, they proposed a smarter, more localized approach.

Here is how BATQuant works, using simple analogies:

1. The "Block-by-Block" Strategy (No Spilling)

Instead of spinning the entire library at once, BATQuant divides the data into small, manageable blocks (like chapters in a book).

The Old Way: If one chapter has a giant elephant, the old method tried to move that elephant to a different chapter to balance things out. This ruined the second chapter.
The BATQuant Way: BATQuant says, "Let's keep the elephant in its own chapter." It applies a special transformation only to that specific block. It reshapes the data inside that block so the elephant fits perfectly without disturbing the mice in the next chapter. This prevents the "energy" (or data) from spilling over and ruining other parts.

2. The "Global & Private" Toolkit (GPK)

Learning a new way to reshape every single block is expensive and takes up too much memory (like needing a unique, custom-made tool for every single book in the library).

The Innovation: BATQuant introduces a clever trick called Global and Private Kronecker (GPK).
The Analogy: Imagine you have a Master Toolkit (Global) that everyone shares, and then a Personal Tool (Private) for each specific book.
- The Master Toolkit handles the general shape of the data (the big picture).
- The Personal Tool handles the tiny, specific quirks of that one block.
The Result: You get the precision of having a custom tool for every block, but you only have to store one Master Toolkit and a few small personal tools. This saves a massive amount of space and makes the system run fast.

3. The "Smart Clipper" (Learnable Clipping)

Sometimes, even after reshaping, there are still a few numbers that are just too big for the tiny backpack.

The Fix: BATQuant uses a Learnable Clipper. Think of this as a smart bouncer at a club. Instead of just cutting off the tails of the crowd (which loses information), the bouncer learns exactly how big the crowd is right now and adjusts the door size dynamically. If the crowd is small, the door is small; if it's huge, the door opens just enough. This ensures no important data gets cut off, but nothing breaks the door.

Why Does This Matter?

The paper tested this on some of the smartest AI models available today (like Qwen3).

The Result: When they tried to compress the models to a tiny 4-bit size (the "aggressive" setting), other methods failed miserably. The models became "dumb," losing up to 20-30% of their intelligence.
BATQuant's Win: BATQuant kept the models almost as smart as the original giant version. On complex tasks like math and reasoning, it recovered 96% to 99% of the original performance.

The Bottom Line

BATQuant is like a master packer who knows exactly how to fold clothes.

Old methods tried to shake the whole suitcase to fit everything in, which resulted in wrinkled clothes and broken zippers.
BATQuant carefully folds each item (block) individually, uses a shared set of folding rules (Global) with a few custom tweaks (Private), and trims the edges perfectly (Clipping).

This allows us to run super-smart AI models on small devices without them losing their "brain," making advanced AI accessible to everyone, everywhere.

1. Problem Statement

The paper addresses the critical challenge of deploying Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) on modern hardware accelerators that support Microscaling Floating-Point (MXFP) formats, specifically MXFP4.

The Context: MXFP formats (e.g., MXFP4) use block-wise scaling (typically 32 elements per block) to handle the long-tailed distributions of activations better than fixed-point formats. However, achieving accurate 4-bit quantization (W4A4) remains an unsolved problem.
The Failure of Existing Methods: State-of-the-art Post-Training Quantization (PTQ) methods designed for integer formats (like QuaRot and SpinQuant) rely on global orthogonal rotations to suppress outliers. When applied to MXFP4, these methods suffer from severe performance collapse.
Root Causes Identified:
1. Cross-Block Energy Transfer: Global rotations inadvertently transfer "outlier energy" across quantization blocks. Since MXFP scaling is local (block-wise), this creates new outliers in previously clean blocks, disrupting local scaling factors.
2. Bimodal Distributions: Techniques like block-wise Hadamard transforms (used in BRQ) often induce bimodal activation distributions within blocks, leading to inefficient utilization of the limited 4-bit quantization range and increased quantization error.

2. Methodology: BATQuant

The authors propose BATQuant (Block-wise Affine Transformation), a framework designed specifically to align with MXFP granularity while optimizing distribution shaping.

A. Block-wise Affine Transformation (BAT)

Instead of global rotations, BATQuant restricts transformations to align strictly with the MXFP quantization block size (e.g., $g=32$ ).

Mechanism: The transformation matrix $\mathbf{P}$ is constructed as a block-diagonal matrix where each block $\mathbf{P}_i$ is an independent, learnable affine transformation ( $\mathbb{R}^{g \times g}$ ).
Benefit: This prevents cross-block energy transfer. Outlier suppression occurs locally, ensuring that the scaling factor for each block accurately captures its specific dynamic range without interference from other blocks.
Relaxed Constraints: Unlike rotation-based methods, BAT relaxes orthogonality constraints, allowing the model to learn optimal affine matrices that minimize quantization error specifically for the MXFP format.

B. Global and Private Kronecker (GPK) Decomposition

Learning a full affine matrix for every block introduces significant storage and computational overhead. To address this, BATQuant introduces GPK:

Decomposition: Each block transformation $\mathbf{P}_i$ is decomposed into the Kronecker product of a Global Shared Matrix ( $\mathbf{A}$ ) and a Block-Specific Private Matrix ( $\mathbf{B}_i$ ):
$\mathbf{P}_i = \mathbf{B}_i \otimes \mathbf{A}$
Efficiency: The global matrix $\mathbf{A}$ is shared across all blocks, while $\mathbf{B}_i$ captures block-specific nuances. This drastically reduces the number of learnable parameters (by >74% compared to FlatQuant) while maintaining low inference complexity via vectorization tricks.

C. Block-wise Learnable Clipping

Even with affine transformations, residual outliers may persist.

Strategy: A fine-grained clipping mechanism is applied to each block.
Implementation: Dynamic bounds ( $\beta_i^{min}, \beta_i^{max}$ ) are learned for each block $i$ , defined as a learnable ratio (via sigmoid) of the block's local min/max values. This suppresses extreme values that would otherwise dominate the quantization range.

D. Integration

Weights: Transformations are fused offline into the linear layers.
Activations: Transformations and clipping are applied online during inference.
Architecture: Integrated into both LLM (Qwen3) and MLLM (Qwen3-VL) architectures, covering MLP and Self-Attention modules.

3. Key Contributions

BATQuant Framework: The first method to effectively enable aggressive W4A4 quantization on MXFP formats by replacing global rotations with block-wise affine transformations that respect hardware granularity.
GPK Decomposition: A novel parameter-efficient decomposition strategy that reduces storage overhead while preserving the capacity to model complex local distributions.
Outlier Resilience: The combination of block-wise constraints and learnable clipping effectively prevents cross-block energy transfer and bimodal distribution issues, which are the primary failure modes of previous methods.
Comprehensive Evaluation: Extensive validation on both LLMs and MLLMs across diverse tasks (reasoning, OCR, visual understanding).

4. Experimental Results

The method was evaluated on Qwen3-8B (LLM) and Qwen3-VL-8B-Instruct (MLLM) against baselines like RTN, QuaRot, SpinQuant, BRQ, FlatQuant, and GPTQ.

Multimodal Benchmarks (W4A4KV16):
- BATQuant achieves a 96.43% recovery rate of full-precision (BF16) performance.
- It outperforms the strongest baseline (FlatQuant) by 1.64%.
- Under W4A8KV16, it achieves 99.29% recovery (near-lossless).
LLM Benchmarks:
- Reasoning Tasks: In the aggressive W4A4KV16 setting, competing rotation-based methods (SpinQuant, QuaRot) suffer catastrophic failure (e.g., dropping to ~63% accuracy on GSM8K), while BATQuant maintains robust performance (92.45% recovery).
- Non-Reasoning Tasks: Consistently outperforms all baselines, achieving 95.84% recovery in W4A4KV16.
Qualitative Analysis: Visualizations show that BATQuant successfully reshapes activation distributions into compact, unimodal forms within blocks, whereas rotation-based methods create bimodal distributions or fail to suppress local outliers.
Case Studies: In OCR and geometric reasoning tasks, BATQuant correctly identifies details (e.g., train numbers, intersection counts) that BRQ hallucinates or truncates due to quantization noise.

5. Significance

Hardware Alignment: BATQuant bridges the gap between algorithmic quantization and emerging hardware standards (MXFP), enabling the deployment of high-performance models on next-generation accelerators (e.g., AMD CDNA, NVIDIA Hopper/Blackwell) that prioritize microscaling formats.
Solving the "4-bit Wall": It demonstrates that 4-bit quantization for MLLMs is feasible without significant accuracy loss, provided the quantization strategy respects the block-wise nature of the data format.
Efficiency: By using GPK, the method offers a practical solution that balances high accuracy with manageable storage overhead, making it suitable for real-world edge and cloud deployment.

In conclusion, BATQuant establishes a new state-of-the-art for MXFP4 quantization, proving that local, learnable, block-wise optimization is superior to global rotation for handling outliers in microscaling floating-point formats.