REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Here is an explanation of the paper "Reap the Experts" in simple, everyday language, using analogies to make the concepts clear.

The Big Picture: The "Too Many Chefs" Problem

Imagine you have a massive, super-smart kitchen (a Large Language Model) designed to write code, tell stories, and solve math problems. To make this kitchen incredibly powerful, the architects built it with Mixture-of-Experts (MoE).

Think of this kitchen as having hundreds of specialized chefs (the "experts").

One chef is amazing at baking.
Another is a master at grilling.
Another is a genius at chopping vegetables.

There is also a Head Chef (the "Router"). When you order a meal, the Head Chef doesn't ask all the chefs to cook. Instead, they look at the order and say, "Okay, for this steak, I need the Griller and the Sauce Chef. Ignore the Bakers." This makes the kitchen fast and efficient because only a few chefs are working at any given time.

The Problem: Even though only a few chefs work at once, you still have to pay for, house, and feed all of them. The kitchen is huge, expensive, and takes up too much memory. You want to shrink the kitchen to fit in a smaller space (like a home computer or a phone) without losing the quality of the food.

The Two Ways to Shrink the Kitchen

The paper compares two ways to cut the kitchen down by 50%:

1. The "Smoothie" Method (Expert Merging)

This is what previous researchers tried. They decided to take two chefs who seemed similar (say, a Griller and a BBQ Chef) and blend them into one new "Super-Chef."

How it works: They mix their recipes, average their skills, and hope the new chef can do both jobs well.
The Flaw: The paper argues this is like making a smoothie. You lose the distinct flavor of the individual ingredients.
- If the Head Chef needs pure grilling skills, the "Super-Chef" might be a bit too focused on BBQ sauce.
- The Head Chef loses the ability to say, "I need exactly the Griller right now." They are forced to use the blended version.
- Result: For simple multiple-choice questions (like "Is this steak done?"), the smoothie works fine. But for complex, creative tasks (like writing a novel or debugging complex code), the smoothie tastes "meh." The kitchen loses its nuance.

2. The "Firing" Method (Expert Pruning)

This is the method the paper proposes, called REAP. Instead of blending chefs, they simply fire the ones who are rarely needed and let the remaining chefs keep their unique skills.

The Old Way of Firing: Just fire the chefs who show up the least often.
The REAP Way: This is the paper's innovation. It doesn't just count how often a chef works; it looks at how important their work is when they do work.
- Analogy: Imagine a chef who only works on Tuesdays. If they show up, they are the only one who can make the perfect soufflé. Even though they work rarely, they are vital.
- REAP looks at the "Head Chef's notes" (gate values) and the "quality of the dish" (activation norms) to decide who to keep. It fires the chefs who are either rarely called or whose dishes aren't that special when they are called.

Why Pruning Wins (The "Functional Collapse" Analogy)

The paper uses a cool visual analogy involving a dance floor.

The Original Kitchen: Imagine 100 dancers (experts) moving in a large, complex pattern. Each dancer has a unique spot and style.
Merging (The Smoothie): When you merge dancers, you force them to stand in the middle of the floor and hold hands, moving as a single, stiff blob. The unique, wild movements of the individual dancers disappear. The dance floor "collapses" into a small, boring circle. This works for a simple march, but it fails for a complex jazz routine.
Pruning (REAP): When you prune, you ask 50 dancers to leave the room. The remaining 50 dancers stay exactly where they were, keeping their unique moves and spacing. The dance floor is smaller, but the shape of the dance remains the same. The Head Chef can still call out specific dancers to do specific moves.

The Results: "Near-Lossless" Compression

The researchers tested this on massive models (some with 1 trillion parameters!).

The Test: They tried to shrink the models by 50% (cutting the number of experts in half).
The Outcome:
- Merging (Smoothie): The models got terrible at creative tasks like coding and writing. They became repetitive and confused.
- Pruning (REAP): The models stayed almost as smart as the original. On coding tasks, they were "near-lossless," meaning they barely lost any ability to write code, even after firing half the staff.

The Secret Sauce: Why REAP Works

The paper's main discovery is that control is everything.

In a complex AI, the "Head Chef" (Router) needs to be able to switch between experts instantly and precisely.

Merging ties the experts together. Once Chef A and Chef B are blended, the Head Chef can't choose one over the other anymore. They are stuck with the average.
Pruning keeps the Head Chef in full control. They can still say, "I need the Math Expert, not the Writing Expert," because those experts are still distinct individuals.

Summary for the Everyday Person

Imagine you have a library with 1,000 books. You want to fit it into a small backpack.

Merging is like photocopying pages from Book A and Book B, stapling them together, and calling it "Book AB." You save space, but the story gets messy and confusing.
Pruning (REAP) is like reading the library's checkout logs. You realize that 500 books are never checked out, or when they are, they don't add much value. You throw those away. You keep the 500 most important, unique books. The backpack is half the size, but the stories inside are still perfect.

The Takeaway: If you want to shrink a super-smart AI without making it dumb, don't blend its parts together. Just get rid of the parts that aren't doing much work, and let the best parts keep doing their unique jobs.

Here is a detailed technical summary of the paper "REAP THE EXPERTS: WHY PRUNING PREVAILS FOR ONE-SHOT MOE COMPRESSION".

1. Problem Statement

Sparsely-activated Mixture-of-Experts (SMoE) models have become the standard for Large Language Models (LLMs) due to their efficient pre-training and low inference latency. However, they suffer from significant memory overhead because they require a massive number of parameters to achieve high accuracy. Furthermore, inference often leads to unbalanced expert usage, causing poor hardware utilization.

To address this, researchers have explored expert compression techniques, primarily expert pruning (removing experts entirely) and expert merging (combining multiple experts into one). Recent literature suggested that merging outperforms pruning on discriminative benchmarks (e.g., Multiple Choice questions). However, this paper argues that merging introduces fundamental flaws when applied to generative tasks (e.g., code generation, creative writing, reasoning), where maintaining the model's ability to generate coherent, diverse text is critical. The core problem is identifying a compression strategy that reduces model size without degrading generative performance.

2. Methodology & Theoretical Analysis

The authors propose that existing merging techniques fail because they destroy the independent, input-dependent control the router has over individual experts.

Theoretical Insight: Irreducible Error in Merging

The paper derives a mathematical bound showing that merging experts introduces an irreducible error:

Mechanism: Merging combines two experts, $f_i$ and $f_j$ , into a static average $\tilde{f}$ . The router then applies a summed gate $(g_i + g_j)$ to this single static expert.
The Flaw: In the original model, the router dynamically modulates the mix ratio $r(x) = \frac{g_i}{g_i+g_j}$ based on the input $x$ . Merging forces the model to approximate a dynamic, input-dependent target with a static function.
Error Bound: The error is proportional to the variance of the router's policy ( $Var[r(x)]$ ) and the functional gap between experts. In high-granularity SMoEs (many experts per layer), routing policies are highly variable. Merging collapses this variability, leading to "functional subspace collapse" where the model loses the ability to select specific specialized functions for specific inputs.

Proposed Method: REAP (Router-weighted Expert Activation Pruning)

Instead of merging, the authors propose pruning experts that contribute the least to the layer output. They introduce a novel saliency criterion, REAP, to determine which experts to remove.

Saliency Score ( $S_j$ ): Unlike previous methods that rely solely on usage frequency or activation norms, REAP calculates the score as the average of the expert's router gate-value multiplied by its activation norm over the tokens where it is active:
$S_j = \frac{1}{|X_j|} \sum_{x \in X_j} g_j(x) \cdot \|f_j(x)\|_2$
Where $X_j$ is the set of tokens where expert $j$ is active.
Rationale: This metric minimizes the upper bound of the reconstruction error. It prioritizes removing experts that have low activation norms even when the router selects them, ensuring that the remaining experts are both frequently used and functionally significant.
One-Shot Compression: The method is "one-shot," meaning it compresses the model without any additional fine-tuning (post-training), making it highly efficient.

3. Key Contributions

Theoretical Proof of Merging Limitations: The authors demonstrate that expert merging introduces an irreducible error bound proportional to the router's policy variability. This explains why merging fails on generative tasks where fine-grained routing is essential.
Manifold Preservation: Empirical analysis using PCA and Wasserstein distance shows that pruning preserves the topology of the functional expert manifold, whereas merging causes "functional subspace collapse" and distorts the manifold by introducing novel, non-existent functions.
REAP Algorithm: Introduction of a new pruning criterion that jointly considers router gate-values and expert activation norms, outperforming frequency-based and activation-norm-only baselines.
Comprehensive Evaluation: A rigorous evaluation across diverse SMoE architectures (ranging from 20B to 1T parameters) and tasks (Code, Math, Creative Writing, MC).

4. Experimental Results

The authors evaluated REAP against state-of-the-art merging methods (HC-SMoE, M-SMoE) and other pruning baselines (Frequency, EAN) on models like Qwen3-30B, GLM-4.5-Air, Qwen3-Coder-480B, and Kimi-K2.

Generative Tasks (Code, Math, Writing):
- REAP significantly outperforms merging. At 50% compression, merging methods often suffer catastrophic drops in accuracy (e.g., >20% drop in code generation), while REAP maintains near-lossless performance.
- Qwen3-Coder-480B & Kimi-K2: REAP achieved near-lossless compression ( $\Delta acc \le 2\%$ ) on code generation tasks even after pruning 50% of experts.
- Merging Failure: Merging methods showed high variance and poor consistency across different models and compression ratios on generative tasks.
Discriminative Tasks (Multiple Choice):
- Both pruning and merging perform reasonably well, with merging sometimes showing slight advantages. This aligns with the theory that MC tasks rely on discriminative functions that can be approximated by averages, whereas generative tasks require precise, dynamic routing.
Scalability:
- REAP scales effectively to trillion-parameter models (Kimi-K2, 1T parameters).
- It is compatible with quantization (e.g., W4A16), allowing for extreme compression ratios (e.g., 87.5% size reduction) without the re-quantization complexities required by merging.
Calibration Sensitivity:
- The paper highlights the importance of domain-specific calibration. Using general data (C4) for calibration caused code generation models to collapse, whereas domain-specific data (evol-codealpaca) preserved high accuracy. REAP was the most robust method across different calibration datasets.

5. Significance and Conclusion

This paper fundamentally shifts the paradigm for SMoE compression:

Pruning > Merging for Generative AI: It challenges the prevailing view that merging is superior, proving that for generative tasks, preserving the router's independent control over experts via pruning is critical.
Theoretical Grounding: It provides a rigorous mathematical explanation (irreducible error due to policy variance) for why merging fails in high-granularity models.
Practical Impact: REAP enables the deployment of massive, high-performance SMoE models in resource-constrained environments (e.g., local deployment, edge devices) by reducing memory footprint by 50% with negligible loss in generative capability.
Open Source: The authors have open-sourced the code and compressed checkpoints to facilitate further research.

In summary, REAP demonstrates that for generative LLMs, the "lossy" nature of merging is fatal to performance, while a carefully selected pruning strategy that respects the router-expert coordination can achieve efficient, high-fidelity compression.