Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Imagine you have a massive library of books (these are Large Language Models or LLMs, like the ones powering advanced AI). These books are so huge that they don't fit on a single bookshelf; they require a whole warehouse. To make them easier to carry and read quickly, we need to shrink them down. This process is called Quantization.

Think of quantization like compressing a high-resolution photo into a smaller file size. You want to make it small enough to fit in your pocket without losing the details that make the picture recognizable.

For a long time, the standard way to shrink these AI models was to use Integers (whole numbers like 1, 2, 3). It's like trying to describe a painting using only a limited palette of solid blocks of color. It works well for simple shapes, but when the painting has subtle gradients, shadows, or sudden bright flashes (what AI researchers call "outliers"), the integer method gets blurry.

The Problem: The "One-Size-Fits-All" Trap

The paper argues that as AI models get smarter, they become more chaotic. Their data has wild swings—sometimes very small numbers, sometimes huge spikes.

Integers (INT8/INT4) are like a ruler with evenly spaced marks. If you try to measure a tiny ant and a giant elephant with the same ruler, you either lose the detail of the ant or run out of room for the elephant.
Floating-Point (FP) formats are like a zoomable lens. They can focus on tiny details or zoom out to see huge things, but they can be inefficient if you don't need that much zoom.

The Solution: HiFloat (The "Smart Shrinker")

The authors from Huawei introduce a new family of formats called HiFloat (specifically HiF8 and HiF4). Think of HiFloat as a smart, adaptive suitcase designed specifically for the Huawei Ascend NPU chips (the hardware the AI runs on).

Here is how it works, broken down by the paper's findings:

1. The 8-Bit Version (HiF8): The "Specialized Tool"

When we shrink the model to 8 bits (a moderate compression):

Weights (The static knowledge): The paper found that for the "static" parts of the AI (the weights, which are like the facts the AI has memorized), the old-school Integer method is actually better. Why? Because the facts are usually clustered in a narrow range. Using a fancy zoom lens (floating point) here is like using a sledgehammer to crack a nut; it wastes space on features you don't need.
Activations (The dynamic thoughts): When the AI is actually thinking and processing a sentence, the numbers swing wildly. Here, HiF8 shines. Its "zoom lens" ability handles the sudden spikes in data much better than the rigid integer ruler.

Analogy: Imagine packing for a trip.

Weights are your clothes: They are predictable. A standard suitcase (Integer) works best.
Activations are your luggage during a chaotic airport rush: Things are flying everywhere. You need a flexible, expandable bag (HiF8) that can stretch to fit the chaos.

2. The 4-Bit Version (HiF4): The "Hierarchical Masterpiece"

This is the paper's biggest breakthrough. When we try to shrink the model extremely (down to 4 bits), the old methods fail spectacularly.

The Failure: Standard 4-bit integers are like trying to describe a complex landscape with only 16 colors. The picture turns into a muddy mess. The AI stops making sense.
The HiF4 Fix: HiF4 uses a three-level hierarchy. Imagine you are organizing a massive crowd of people:
- Level 1: You group them into 64-person blocks and give the whole block a general size estimate.
- Level 2: Inside that block, you split them into 8 smaller groups and refine the estimate.
- Level 3: Finally, you look at 4 people at a time and give them a precise size.

This "Russian Nesting Doll" approach allows HiF4 to handle the "outliers" (the weird, huge numbers) without ruining the precision for the normal numbers. It's like having a map that shows the whole country, then zooms into the city, then zooms into the street, all at once.

The Results: Why Should You Care?

The researchers tested this on real AI models (Qwen3-8B and openPangu-7B) and found:

No Crash: When they tried to run the AI at 4-bit precision, the old methods (Integers) caused the AI to "collapse" and give nonsense answers. HiF4 kept the AI working almost as well as the full, uncompressed version.
Speed & Efficiency: Because HiF8 and HiF4 are built specifically for Huawei's Ascend chips, they don't just save space; they run faster and use less energy.
Compatibility: It plays nicely with other existing tools that help AI run efficiently.

The Bottom Line

This paper is like discovering a new, super-efficient packing technique for a moving truck.

If you are moving standard furniture (Weights), use the standard boxes (Integers).
If you are moving fragile, oddly shaped art (Activations), use the custom foam inserts (HiF8).
If you are trying to fit a whole mansion into a tiny van (4-bit inference), you need the HiF4 hierarchical system. It's the only way to keep everything from breaking while fitting it all in.

For the future of AI, especially on specialized hardware like Huawei's chips, HiFloat offers a way to make these giant brains smaller, faster, and cheaper to run without losing their intelligence.

1. Problem Statement

As Large Language Models (LLMs) scale, the computational throughput and memory bandwidth requirements have become critical bottlenecks. While quantization is a standard solution to reduce memory overhead, existing research has predominantly focused on integer-based quantization (e.g., INT8, INT4). However, integer formats struggle with:

High-variance data: LLM activations often contain "outliers" (extreme values) that force integer quantization to use coarse steps, degrading resolution for the majority of data points near zero.
Ultra-low bit regimes (4-bit): Uniform integer spacing leads to catastrophic accuracy collapse when bit-widths are reduced to 4 bits, as they cannot capture the necessary dynamic range and local variance simultaneously.

While floating-point formats (like MXFP and NVFP4) offer better dynamic range, their efficacy in Post-Training Quantization (PTQ) for inference on specific hardware (Ascend NPUs) and their synergy with existing outlier-mitigation frameworks remains under-explored.

2. Methodology & Proposed Solution

The authors evaluate the HiFloat family of formats, specifically designed for Ascend NPUs, which utilize hierarchical scaling to balance dynamic range and precision.

A. The HiFloat Formats

HiF8 (8-bit):
- Mechanism: Extends IEEE 754 with a dynamic prefix code. It adaptively allocates bits between the exponent and mantissa based on the magnitude of the input.
- Structure: Uses a sign bit, a "dot" field to distinguish normal/subnormal forms, and a context-dependent exponent field that can imply a leading bit.
- Goal: Provides extended dynamic range for volatile data (activations) while maintaining precision for static data (weights).
HiF4 (4-bit):
- Mechanism: Employs a three-level hierarchical scaling structure within a 64-element block.
- Structure:
  - Level 1: A shared 8-bit unsigned scale ( $E6M2$ ) for the entire 64-element block.
  - Level 2: A 1-bit scale ( $E1M0$ ) for 8-element sub-blocks.
  - Level 3: A 1-bit scale ( $E1M0$ ) for 4-element micro-blocks.
  - Element: The actual data is stored in a 4-bit $E1M2$ format.
- Goal: To isolate outliers at multiple granularities, preventing a single extreme value from degrading the precision of the entire block.

B. Evaluation Framework

Models: Evaluated on Qwen3-8B and openPangu-7B.
Tasks: Comprehensive evaluation across Weights, Activations, and KV Cache.
Baselines: Compared against INT8/INT4, MXFP8/MXFP4, and NVFP4.
Synergy: Tested integration with state-of-the-art PTQ frameworks: SmoothQuant (magnitude smoothing) and SVDQuant (low-rank outlier absorption).
Metrics: Signal-to-Quantization-Noise Ratio (SQNR), Perplexity (PPL), and zero-shot accuracy on benchmarks (ARC, MMLU, GSM8K, etc.).

3. Key Contributions

Mathematical Formulation: Provided formal quantization/dequantization logic for HiF8 and HiF4, detailing their adaptive bit-allocation and hierarchical scaling mechanisms.
Distributional Analysis: Demonstrated that INT8 is superior for weights (which have narrow, bounded distributions) due to uniform density, whereas floating-point formats (HiF8/MXFP8) excel for activations (which have high variance and outliers).
The 4-bit Breakthrough: Identified that in 4-bit regimes, hierarchical floating-point formats are essential. HiF4 prevents the "accuracy collapse" seen in INT4 and MXFP4 by using multi-level scaling to preserve local precision while handling outliers.
Algorithmic Synergy: Showed that HiFloat formats are fully compatible with SmoothQuant and SVDQuant, yielding compounded benefits for model compression.

4. Key Results

8-bit Regime (W8A8)

Weights: INT8 outperforms floating-point formats (including HiF8) because weight distributions are static and narrow; the "exponent waste" in floating-point formats reduces effective precision.
Activations: HiF8 (scaled) and MXFP8 outperform INT8. The dynamic range of floating-point formats better accommodates activation outliers.
Performance: With SmoothQuant or SVDQuant, HiF8 achieves near-lossless performance (matching BF16 baselines) on both models.

4-bit Regime (W4A4) - The Critical Finding

Integer Failure: INT4 suffers catastrophic failure (accuracy drops >60% or total collapse) even with smoothing techniques. Uniform spacing is insufficient for 4-bit inference.
MXFP4 Limitations: While better than INT4, MXFP4 (single-level scaling) still incurs significant accuracy drops (e.g., ~10% on Qwen3-8B).
HiF4 Superiority:
- HiF4 and NVFP4 maintain model integrity.
- On Qwen3-8B, HiF4 limits accuracy degradation to ~3.5% (RTN) and ~2.0% (with SVDQuant), preserving over 96.5% of the BF16 baseline.
- On openPangu-7B, HiF4 restricts degradation to 3.0%, significantly outperforming MXFP4 and NVFP4 in specific configurations.

KV Cache Quantization

Key States: Highly volatile; hierarchical formats (HiF4/NVFP4) significantly outperform uniform baselines.
Value States: More stable; HiF4 consistently achieves the best results among 4-bit formats.
End-to-End: When combining W4A4 with 4-bit KV cache (QKV4), HiF4 is the only format that remains robust, limiting accuracy loss to 3.92% on Qwen3-8B and 13.84% on openPangu-7B, whereas others suffer severe degradation.

5. Significance

Hardware Optimization: This work validates HiFloat as a native, high-efficiency solution for Ascend NPUs, providing a path to deploy large models with minimal memory footprint without sacrificing accuracy.
Paradigm Shift for 4-bit Inference: The paper establishes that hierarchical floating-point quantization is the necessary standard for 4-bit LLM inference, replacing the traditional reliance on integer quantization which fails at this precision level.
Practical Deployment: By proving compatibility with SmoothQuant and SVDQuant, the authors provide a ready-to-use pipeline for high-fidelity, low-bit inference, enabling single-GPU execution of large-scale models.

In conclusion, the paper demonstrates that while integer formats are optimal for 8-bit weights, hierarchical floating-point formats (specifically HiF4) are the critical enabler for reliable, high-performance 4-bit inference across weights, activations, and KV caches.