HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr is a novel pruning algorithm for Mixture-of-Experts models that decomposes experts into atomic units and leverages second-order output information to achieve nearly lossless compression with reduced computational complexity, significantly outperforming existing expert-level pruning methods.

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you have a massive, super-smart library of experts (a "Mixture of Experts" or MoE model) that helps a computer think and write. This library is huge. It has thousands of specialized experts, but every time the computer needs to answer a question, it only calls upon a few of them.

The Problem:
Even though the computer only uses a few experts at a time, it has to keep all of them in its memory (like keeping every single book on a shelf, even if you only read one). This makes the library so heavy and expensive to run that it's hard to put on regular devices like phones or laptops.

The Old Way (Expert Pruning):
Previously, people tried to shrink this library by throwing out entire experts. Imagine looking at a team of 100 chefs and saying, "Chef #42 is bad, fire him!" You remove the whole person.

  • The downside: It's too blunt. You might fire a chef who is great at making desserts but terrible at soups. If you remove them entirely, you lose that specific skill, and the library's quality drops.

The New Solution: HEAPr (The "Atomic" Approach)
The authors of this paper, HEAPr, realized that an "Expert" isn't just one indivisible person. It's actually a team of tiny, atomic specialists working together.

Think of an Expert like a Swiss Army Knife.

  • Old Method: If you want to make the knife lighter, you throw away the whole knife because the screwdriver part is rarely used. Now you have no knife at all.
  • HEAPr Method: You realize the knife is made of separate, tiny blades. You can carefully unscrew and remove just the screwdriver blade and the can-opener blade, while keeping the knife blade and scissors. The tool is lighter, but it still cuts perfectly.

How Does HEAPr Know What to Cut? (The "Brain Surgeon" Analogy)

To decide which tiny blades to remove without hurting the knife's performance, HEAPr uses a concept from the 1990s called Optimal Brain Surgeon.

Imagine a master surgeon trying to remove a tiny, useless nerve from a patient's brain without causing any damage.

  1. The Challenge: You can't just guess. You need to know exactly how much the brain's function will change if you cut a specific nerve.
  2. The Math Problem: Usually, calculating this "damage" requires a massive, complex map of the entire brain (mathematically, this is called a "Hessian matrix"). For a giant AI model, this map is so huge it would crash your computer's memory.
  3. HEAPr's Trick:
    • Step 1: They realized that the tiny "atomic" parts of an expert don't really talk to each other. They are independent. So, you don't need a map of the whole brain; you just need a tiny map for each specific nerve.
    • Step 2: Instead of looking at the nerve itself (the parameters), they look at what the nerve produces (the output). It's like judging a worker not by how they sit at their desk, but by the quality of the box they pack.
    • The Result: This simplifies the math so much that the computer can calculate the "importance" of every single tiny blade in a flash, using just a few test questions.

The Results: A Lighter Library, Same Quality

The team tested this on some of the smartest AI models in the world (like Qwen and DeepSeek).

  • The Magic Zone: They were able to cut 20% to 25% of the model's size (removing those useless tiny blades) and the AI's performance didn't drop at all. It was "lossless."
  • The Speed: Because they removed so much weight, the AI runs about 20% faster and uses less energy.
  • Comparison: Other methods that tried to cut whole "Experts" or merge them together often made the AI dumber. HEAPr kept the AI just as smart but much lighter.

Summary

HEAPr is like a master sculptor who doesn't just smash chunks off a statue (old method). Instead, they carefully chip away the tiny, unnecessary dust from the surface, making the statue lighter and easier to carry, while keeping the beautiful face and hands perfectly intact. It allows us to run super-smart AI on smaller devices without losing any of its genius.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →