Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

This paper evaluates HiFloat formats (HiF8 and HiF4) on Ascend NPUs, demonstrating their superior performance over integer formats in handling high-variance data and preventing accuracy collapse in 4-bit regimes while maintaining compatibility with state-of-the-art quantization frameworks for efficient LLM inference.

Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Manyi Zhang, Yuanyong Luo, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you have a massive library of books (these are Large Language Models or LLMs, like the ones powering advanced AI). These books are so huge that they don't fit on a single bookshelf; they require a whole warehouse. To make them easier to carry and read quickly, we need to shrink them down. This process is called Quantization.

Think of quantization like compressing a high-resolution photo into a smaller file size. You want to make it small enough to fit in your pocket without losing the details that make the picture recognizable.

For a long time, the standard way to shrink these AI models was to use Integers (whole numbers like 1, 2, 3). It's like trying to describe a painting using only a limited palette of solid blocks of color. It works well for simple shapes, but when the painting has subtle gradients, shadows, or sudden bright flashes (what AI researchers call "outliers"), the integer method gets blurry.

The Problem: The "One-Size-Fits-All" Trap

The paper argues that as AI models get smarter, they become more chaotic. Their data has wild swings—sometimes very small numbers, sometimes huge spikes.

  • Integers (INT8/INT4) are like a ruler with evenly spaced marks. If you try to measure a tiny ant and a giant elephant with the same ruler, you either lose the detail of the ant or run out of room for the elephant.
  • Floating-Point (FP) formats are like a zoomable lens. They can focus on tiny details or zoom out to see huge things, but they can be inefficient if you don't need that much zoom.

The Solution: HiFloat (The "Smart Shrinker")

The authors from Huawei introduce a new family of formats called HiFloat (specifically HiF8 and HiF4). Think of HiFloat as a smart, adaptive suitcase designed specifically for the Huawei Ascend NPU chips (the hardware the AI runs on).

Here is how it works, broken down by the paper's findings:

1. The 8-Bit Version (HiF8): The "Specialized Tool"

When we shrink the model to 8 bits (a moderate compression):

  • Weights (The static knowledge): The paper found that for the "static" parts of the AI (the weights, which are like the facts the AI has memorized), the old-school Integer method is actually better. Why? Because the facts are usually clustered in a narrow range. Using a fancy zoom lens (floating point) here is like using a sledgehammer to crack a nut; it wastes space on features you don't need.
  • Activations (The dynamic thoughts): When the AI is actually thinking and processing a sentence, the numbers swing wildly. Here, HiF8 shines. Its "zoom lens" ability handles the sudden spikes in data much better than the rigid integer ruler.

Analogy: Imagine packing for a trip.

  • Weights are your clothes: They are predictable. A standard suitcase (Integer) works best.
  • Activations are your luggage during a chaotic airport rush: Things are flying everywhere. You need a flexible, expandable bag (HiF8) that can stretch to fit the chaos.

2. The 4-Bit Version (HiF4): The "Hierarchical Masterpiece"

This is the paper's biggest breakthrough. When we try to shrink the model extremely (down to 4 bits), the old methods fail spectacularly.

  • The Failure: Standard 4-bit integers are like trying to describe a complex landscape with only 16 colors. The picture turns into a muddy mess. The AI stops making sense.
  • The HiF4 Fix: HiF4 uses a three-level hierarchy. Imagine you are organizing a massive crowd of people:
    • Level 1: You group them into 64-person blocks and give the whole block a general size estimate.
    • Level 2: Inside that block, you split them into 8 smaller groups and refine the estimate.
    • Level 3: Finally, you look at 4 people at a time and give them a precise size.

This "Russian Nesting Doll" approach allows HiF4 to handle the "outliers" (the weird, huge numbers) without ruining the precision for the normal numbers. It's like having a map that shows the whole country, then zooms into the city, then zooms into the street, all at once.

The Results: Why Should You Care?

The researchers tested this on real AI models (Qwen3-8B and openPangu-7B) and found:

  1. No Crash: When they tried to run the AI at 4-bit precision, the old methods (Integers) caused the AI to "collapse" and give nonsense answers. HiF4 kept the AI working almost as well as the full, uncompressed version.
  2. Speed & Efficiency: Because HiF8 and HiF4 are built specifically for Huawei's Ascend chips, they don't just save space; they run faster and use less energy.
  3. Compatibility: It plays nicely with other existing tools that help AI run efficiently.

The Bottom Line

This paper is like discovering a new, super-efficient packing technique for a moving truck.

  • If you are moving standard furniture (Weights), use the standard boxes (Integers).
  • If you are moving fragile, oddly shaped art (Activations), use the custom foam inserts (HiF8).
  • If you are trying to fit a whole mansion into a tiny van (4-bit inference), you need the HiF4 hierarchical system. It's the only way to keep everything from breaking while fitting it all in.

For the future of AI, especially on specialized hardware like Huawei's chips, HiFloat offers a way to make these giant brains smaller, faster, and cheaper to run without losing their intelligence.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →