HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Imagine you have a massive, super-smart library of experts (a "Mixture of Experts" or MoE model) that helps a computer think and write. This library is huge. It has thousands of specialized experts, but every time the computer needs to answer a question, it only calls upon a few of them.

The Problem:
Even though the computer only uses a few experts at a time, it has to keep all of them in its memory (like keeping every single book on a shelf, even if you only read one). This makes the library so heavy and expensive to run that it's hard to put on regular devices like phones or laptops.

The Old Way (Expert Pruning):
Previously, people tried to shrink this library by throwing out entire experts. Imagine looking at a team of 100 chefs and saying, "Chef #42 is bad, fire him!" You remove the whole person.

The downside: It's too blunt. You might fire a chef who is great at making desserts but terrible at soups. If you remove them entirely, you lose that specific skill, and the library's quality drops.

The New Solution: HEAPr (The "Atomic" Approach)
The authors of this paper, HEAPr, realized that an "Expert" isn't just one indivisible person. It's actually a team of tiny, atomic specialists working together.

Think of an Expert like a Swiss Army Knife.

Old Method: If you want to make the knife lighter, you throw away the whole knife because the screwdriver part is rarely used. Now you have no knife at all.
HEAPr Method: You realize the knife is made of separate, tiny blades. You can carefully unscrew and remove just the screwdriver blade and the can-opener blade, while keeping the knife blade and scissors. The tool is lighter, but it still cuts perfectly.

How Does HEAPr Know What to Cut? (The "Brain Surgeon" Analogy)

To decide which tiny blades to remove without hurting the knife's performance, HEAPr uses a concept from the 1990s called Optimal Brain Surgeon.

Imagine a master surgeon trying to remove a tiny, useless nerve from a patient's brain without causing any damage.

The Challenge: You can't just guess. You need to know exactly how much the brain's function will change if you cut a specific nerve.
The Math Problem: Usually, calculating this "damage" requires a massive, complex map of the entire brain (mathematically, this is called a "Hessian matrix"). For a giant AI model, this map is so huge it would crash your computer's memory.
HEAPr's Trick:
- Step 1: They realized that the tiny "atomic" parts of an expert don't really talk to each other. They are independent. So, you don't need a map of the whole brain; you just need a tiny map for each specific nerve.
- Step 2: Instead of looking at the nerve itself (the parameters), they look at what the nerve produces (the output). It's like judging a worker not by how they sit at their desk, but by the quality of the box they pack.
- The Result: This simplifies the math so much that the computer can calculate the "importance" of every single tiny blade in a flash, using just a few test questions.

The Results: A Lighter Library, Same Quality

The team tested this on some of the smartest AI models in the world (like Qwen and DeepSeek).

The Magic Zone: They were able to cut 20% to 25% of the model's size (removing those useless tiny blades) and the AI's performance didn't drop at all. It was "lossless."
The Speed: Because they removed so much weight, the AI runs about 20% faster and uses less energy.
Comparison: Other methods that tried to cut whole "Experts" or merge them together often made the AI dumber. HEAPr kept the AI just as smart but much lighter.

Summary

HEAPr is like a master sculptor who doesn't just smash chunks off a statue (old method). Instead, they carefully chip away the tiny, unnecessary dust from the surface, making the statue lighter and easier to carry, while keeping the beautiful face and hands perfectly intact. It allows us to run super-smart AI on smaller devices without losing any of its genius.

1. Problem Statement

Context: Mixture-of-Experts (MoE) models have become the standard for scaling Large Language Models (LLMs) due to their ability to match dense model performance while activating only a fraction of parameters during inference. However, despite sparse activation, the total parameter count (including inactive experts) must be stored in GPU memory, creating a prohibitive memory bottleneck for deployment.

Limitations of Existing Methods:

Coarse Granularity: Existing pruning methods primarily operate at the expert level (removing entire experts). This coarse granularity often leads to significant accuracy degradation because it discards valuable, complementary knowledge contained within specific parts of an expert.
Fine-Grained Trade-offs: While fine-grained pruning (e.g., weight sparsification) preserves accuracy, it often fails to provide hardware acceleration benefits due to irregular memory access patterns.
Second-Order Complexity: Methods based on second-order information (like Optimal Brain Surgeon) are theoretically sound for identifying important parameters but are computationally infeasible for MoE models due to the massive size of the Hessian matrix ( $O((3d_{model} \cdot d_{inter})^2)$ ).

Goal: Develop a pruning method that offers fine-grained flexibility (to preserve accuracy) while enabling direct hardware acceleration (by reducing model dimensions), all without incurring prohibitive computational costs during the pruning process.

2. Methodology: HEAPr

The authors propose HEAPr (Hessian-based Efficient Atomic Expert Pruning in Output Space), a framework that decomposes experts into smaller, indivisible units called Atomic Experts and uses a highly optimized second-order approximation to rank them.

A. Atomic Expert Decomposition

Instead of treating an expert $E_i$ as a monolithic block, the authors decompose it into $d_{inter}$ atomic experts.

An expert consists of $W_{up}$ , $W_{gate}$ , and $W_{down}$ .
The $j$ -th atomic expert corresponds to the $j$ -th column of $W_{up}$ and $W_{gate}$ , and the $j$ -th row of $W_{down}$ .
The output of a full expert is the sum of its atomic experts: $E_i(x) = \sum e^{(j)}_i(x)$ .
Benefit: Pruning an atomic expert removes a specific dimension from the intermediate representation, directly reducing FLOPs and memory, unlike removing a whole expert which might leave the remaining structure inefficient.

B. Theoretical Foundation: Output-Space Hessian Approximation

To determine which atomic experts to prune, HEAPr adapts the Optimal Brain Surgeon (OBS) theory, which minimizes the loss increase ( $\Delta \mathcal{L}$ ) upon parameter removal.

Decoupling Property: The authors observe that parameters of different atomic experts within the same expert are decoupled. The cross-Hessian terms between different atomic experts are zero. This reduces the complexity of the Hessian from a dense block to a block-diagonal structure.
Shift to Output Space: Computing the Hessian with respect to parameters is still too expensive. The authors reformulate the problem:
- Instead of constraining the parameters to zero, they constrain the output of the atomic expert to zero for a given input token.
- Using a first-order Taylor expansion of the atomic expert function, the parameter perturbation $\delta \Theta$ is related to the output perturbation via the Jacobian.
Fisher Information Approximation:
- The expected Hessian is approximated by the Fisher Information Matrix (FIM).
- Crucially, the gradient of the loss with respect to the output of all atomic experts within a single expert is identical ( $\frac{\partial \mathcal{L}}{\partial e^{(j)}} = \frac{\partial \mathcal{L}}{\partial E}$ ).
- This allows the computation of a single shared gradient covariance matrix per expert, rather than one per atomic expert.

C. The Importance Metric

The importance score $s$ for an atomic expert $e_P$ is derived as:
$s \approx \mathbb{E}_{x \sim D} \left[ \frac{1}{2} e_P(x)^\top \bar{G} e_P(x) \right]$
Where:

$e_P(x)$ is the output of the atomic expert.
$\bar{G}$ is the shared gradient covariance matrix (FIM) for the parent expert.
A lower score indicates the atomic expert contributes less to the loss and is a candidate for pruning.

D. Algorithm Efficiency

Complexity Reduction: The space complexity for second-order information is reduced from $O((3d_{model} \cdot d_{inter})^2)$ to $O(d_{model}^2)$ .
Computation Cost: Requires only two forward passes and one backward pass on a small calibration set.
Global Ranking: Unlike layer-wise pruning, HEAPr performs a global ranking of all atomic experts across the entire model, ensuring the most critical units are preserved regardless of their layer.

3. Key Contributions

Atomic Expert Concept: Introduced a novel decomposition of MoE experts into indivisible atomic units, enabling a pruning granularity that balances accuracy preservation with hardware acceleration.
Efficient Second-Order Approximation: Developed a method to transform second-order information from the parameter space to the output space. By leveraging the shared gradient property of atomic experts within an expert, they reduced the space complexity of Hessian estimation from $O(d^4)$ to $O(d^2)$ .
HEAPr Algorithm: Proposed a scalable pruning algorithm that requires minimal calibration data and computational overhead (2 forward, 1 backward pass) to achieve global optimal pruning.
State-of-the-Art Performance: Demonstrated that HEAPr outperforms existing expert-level pruning (dropping/merging) and decomposition methods across diverse MoE architectures.

4. Experimental Results

The method was evaluated on DeepSeekMoE-16B, Qwen1.5-MoE, Qwen2-57B, and Qwen3-30B across seven zero-shot benchmarks (e.g., MMLU, ARC, HellaSwag).

Near-Lossless Compression:
- 20%–25% Pruning: Achieved performance nearly identical to the original models on DeepSeekMoE-16B and Qwen1.5-MoE.
- 40% Pruning: On Qwen2-57B-A14B, performance remained almost identical to the original model.
- Qwen3-30B-A3B: At 25% pruning, average accuracy dropped by only 0.03.
Efficiency Gains: Achieved nearly 20% reduction in FLOPs at 20–25% pruning ratios, a significant improvement over expert-level pruning which often yields minimal FLOPs reduction due to hardware constraints.
Comparison with SOTA: Outperformed methods like NAEE, MoE-I2, D2-MoE, and Sub-MoE.
Comparison with CAMERA-P: HEAPr outperformed the concurrent work CAMERA-P (which uses decoding-time energy) by 1.2% average accuracy at 20% pruning, attributed to HEAPr's global, second-order importance metric versus CAMERA-P's local, heuristic approach.
Ablation Studies:
- Global vs. Layer-wise: Global pruning (HEAPr-G) consistently outperformed layer-wise pruning (HEAPr-L), validating the metric's consistency across layers.
- Granularity: Atomic-level pruning significantly outperformed expert-level pruning in both accuracy and FLOPs reduction.
- Robustness: Performance was stable across different calibration datasets (WikiText-2 vs. C4) and sample sizes.

5. Significance

Bridging the Gap: HEAPr successfully bridges the gap between fine-grained accuracy preservation and coarse-grained hardware acceleration. It proves that MoE models can be compressed significantly without the "all-or-nothing" trade-off of expert dropping.
Scalability: By reducing the complexity of second-order pruning to $O(d^2)$ , the paper makes Hessian-based pruning feasible for massive LLMs, a task previously considered computationally prohibitive.
Deployment Impact: The method enables the deployment of large MoE models on resource-constrained devices (e.g., edge GPUs) by reducing memory footprint and inference costs without requiring retraining or complex distillation.
Theoretical Insight: The work provides a deeper understanding of MoE redundancy, showing that "atomic" components within experts are often the primary source of redundancy, rather than the experts themselves.

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

How Does HEAPr Know What to Cut? (The "Brain Surgeon" Analogy)

The Results: A Lighter Library, Same Quality

Summary

1. Problem Statement

2. Methodology: HEAPr

A. Atomic Expert Decomposition

B. Theoretical Foundation: Output-Space Hessian Approximation

C. The Importance Metric

D. Algorithm Efficiency

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Operational Noncommutativity in Sequential Metacognitive Judgments

Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning