EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

Imagine you have a massive, incredibly smart library (a Large Language Model) filled with thousands of specialized librarians (called "Experts"). When you ask a question, the library doesn't use all the librarians at once. Instead, a smart manager (the "Router") picks just the top few experts who are best suited to answer your specific question. This is called a Sparse Mixture-of-Experts (MoE) model. It's brilliant because it's fast and efficient, but there's a catch: even though you only use a few librarians at a time, you still have to keep all of them in the building. This takes up a huge amount of space (memory) and makes the building expensive to run.

The goal of this paper is to shrink the library by firing some of the librarians so the building fits in a smaller space, without making the library forget how to write good stories or solve math problems.

The Problem: The "One-Size-Fits-All" Mistake

Previous attempts to shrink these libraries tried to fire librarians in a very simple way: "Let's fire 25% of the librarians in every single room, no matter what."

Imagine a library with 16 rooms.

Room 1 might be the "Introduction" room. It needs very few experts.
Room 10 might be the "Complex Logic" room. It needs almost all its experts to function.
Room 16 might be the "Conclusion" room. It needs a different mix.

The old method treated every room exactly the same. It fired 25% of the staff in the "Complex Logic" room (disaster!) and 25% in the "Introduction" room (maybe fine, but maybe we fired the one genius who was needed). This "Uniform" approach often made the library worse at creative tasks like writing code or solving math puzzles, even if it was okay at simple multiple-choice questions.

The Solution: EvoESAP (The Smart Architect)

The authors propose a new method called EvoESAP. Think of it as hiring a Smart Architect who doesn't just fire people randomly, but carefully decides where to cut staff based on how important each room is.

Here is how it works, step-by-step:

1. The "Who to Fire" List (Fixed)

First, the architects look at every single room and rank the librarians from "Least Useful" to "Most Useful." Let's say they decide that in Room 10, Librarian A is the worst, and Librarian B is the best. This ranking stays the same for now.

2. The "How Many to Fire" Puzzle (The Search)

Now comes the tricky part. We have a budget: "We need to fire a total of 500 librarians across the whole building."

Old Way: Fire 31 librarians from every single room (500 / 16 rooms).
EvoESAP Way: The architect uses a Genetic Algorithm (like natural selection) to try thousands of different combinations.
- Try 1: Fire 50 from Room 1, 0 from Room 10, 100 from Room 15.
- Try 2: Fire 10 from Room 1, 80 from Room 10, 20 from Room 15.
- Try 3: ...and so on.

The goal is to find the specific mix where you fire the least important people from the most critical rooms, and the most important people from the least critical rooms, all while hitting your total budget.

3. The "Magic Test" (ESAP)

How does the architect know if a new layout works? Usually, you'd have to let the library run for hours to see if it still answers questions well. That's too slow.

Instead, they invented a Magic Test called ESAP (Expected Speculative Acceptance Proxy).

The Analogy: Imagine the original library is a "Master Chef." The new, shrunk library is a "Student Chef."
Normally, to test the student, you'd have them cook a whole meal and taste it (slow and expensive).
The ESAP Trick: The Master Chef looks at the Student's ingredients and plan before they even start cooking. The Master Chef asks, "If you were to cook this, how likely is it that I would agree with your choices?"
If the Student's plan matches the Master's plan perfectly, the test score is high. If the Student is going to mess up, the score is low.
This test is incredibly fast and tells the architect exactly which "firing plan" keeps the library smartest.

The Results: Why It Matters

The paper tested this on several huge AI models (like OLMoE, ERNIE, and Qwen). The results were impressive:

Better at Creative Stuff: When they used the "Smart Architect" (EvoESAP) instead of the "One-Size-Fits-All" method, the models got much better at writing code and solving math problems. In one case, the math score jumped by nearly 20%.
Same at Simple Stuff: They didn't lose any ability to answer simple multiple-choice questions.
The "Non-Uniform" Secret: The best layouts weren't uniform. Sometimes, the architect decided to fire almost everyone from the first few rooms and keep almost everyone in the middle rooms. This "Non-Uniform" distribution was the key to keeping the model smart.

Summary

Think of EvoESAP as a way to downsize a company without firing the wrong people. Instead of firing 10% of the staff from every department (which might kill the R&D team while barely touching the janitorial staff), it analyzes the whole company and says, "We can fire 50% of the janitors and 10% of the engineers, but we must keep 95% of the R&D team."

By using a fast "Magic Test" to find the perfect balance, they managed to shrink these giant AI models significantly while making them smarter at the things that matter most: writing, coding, and reasoning.

1. Problem Statement

Sparse Mixture-of-Experts (SMoE) models offer high capacity with low per-token computation by routing tokens to a subset of "expert" feed-forward networks. However, deployment remains costly because the entire pool of experts must be stored in memory, creating a bottleneck for memory footprint and serving throughput.

While expert pruning (removing experts post-training) is a viable compression strategy, existing methods suffer from two main limitations:

Focus on Selection, Neglect of Allocation: Most prior work focuses on which experts to prune within a layer (using metrics like frequency or activation norms) but defaults to a uniform layer-wise sparsity allocation (pruning the same percentage of experts in every layer).
Suboptimal Performance: Uniform allocation ignores the fact that different layers may have varying sensitivity to pruning. Furthermore, existing pruning methods are often evaluated primarily on multiple-choice (MC) benchmarks, with open-ended generation (coding, math, creative writing) being less explored and often degraded.

The core problem is: How to optimally distribute a fixed global pruning budget across different layers (non-uniform allocation) to maximize the preservation of the model's generative capabilities?

2. Methodology

The authors propose EvoESAP, a framework that decouples expert pruning into two steps: Within-Layer Selection and Across-Layer Budget Allocation.

A. Decoupling Pruning

Within-Layer Selection (Fixed): First, a standard expert-importance criterion (e.g., Frequency, SEER, EAN, REAP) is used to rank experts within each layer. This establishes a fixed pruning order (e.g., "prune the bottom $k$ experts").
Across-Layer Allocation (Optimized): Instead of pruning a fixed ratio everywhere, the method searches for an optimal non-uniform sparsity schedule (how many experts to remove from each specific layer) under a fixed global budget.

B. The Fitness Function: Expected Speculative Acceptance Proxy (ESAP)

To guide the search, the authors introduce ESAP, a metric inspired by speculative decoding.

Concept: In speculative decoding, a "draft" model proposes tokens, and a "target" (full) model verifies them. If the draft behaves similarly to the target, the acceptance rate is high.
Challenge: Calculating the true speculative acceptance rate requires autoregressive generation, which is computationally prohibitive for evaluating thousands of pruning candidates during an evolutionary search.
Solution (ESAP): ESAP is a teacher-forced, non-autoregressive proxy. It calculates the expected overlap between the next-token distribution of the full model ( $p$ $p$ ) and the pruned model ( $q$ $q$ ) on fixed contexts.
- Mathematically, for a token $v$ , it computes: $\sum_{v} \min(p(v|x), q(v|x))$ .
- This is equivalent to $1 - \text{Total Variation Distance}(p, q)$.
- Advantages: It is bounded, stable, computationally cheap (no autoregressive generation), and directly correlates with behavioral similarity to the full model.

C. Evolutionary Search Framework

The authors use an evolutionary algorithm to find the best non-uniform allocation:

Search Space: Integer vectors representing the number of experts removed per layer ( $r_1, r_2, ..., r_L$ ), subject to a global budget constraint ( $\sum r_i = B$ ).
Mutation Operator: A Level-Switch Mutation is used. It randomly selects two layers and transfers a small amount of the pruning budget from one to the other (e.g., prune 1 more in Layer A, prune 1 fewer in Layer B), ensuring the global budget remains constant.
Process:
1. Initialize a population of allocation strategies (uniform, patterned, random).
2. Evaluate fitness using ESAP on a small calibration dataset.
3. Select top performers and generate offspring via level-switch mutations.
4. Iterate for $T$ generations to converge on the optimal non-uniform schedule.

3. Key Contributions

ESAP Metric: Introduced a novel, efficient, and stable fitness function based on speculative decoding principles that avoids costly autoregressive generation while accurately measuring model similarity.
Non-Uniform Allocation Discovery: Demonstrated that non-uniform layer-wise sparsity is critical for SMoE pruning. The paper shows that "naive" heuristics (like uniform pruning or simple frequency-based global allocation) can significantly degrade performance, whereas optimized non-uniform schedules preserve capabilities.
EvoESAP Framework: A plug-and-play evolutionary search method that can be applied on top of any existing expert-importance metric (Frequency, SEER, EAN, REAP) to optimize the layer-wise budget distribution.
Empirical Validation: Comprehensive evaluation across three large SMoE models (OLMoE, ERNIE, Qwen) at 25% and 50% sparsity levels.

4. Experimental Results

The experiments were conducted on models ranging from 7B to 30B parameters.

Performance Gains: EvoESAP consistently outperforms uniform pruning (UNI) across all tested metrics.
- Open-Ended Generation: The most significant improvements are seen in coding and math tasks.
  - ERNIE-4.5 (50% sparsity): Using the REAP metric, EvoESAP achieved a +19.6% gain on the MATH-500 benchmark compared to uniform pruning.
  - OLMoE (25% sparsity): Improved Code Avg by +2.9% and Math Avg by +2.8%.
- Multiple Choice (MC): Performance remained competitive, often showing negligible change or slight improvements, proving that the gains in generation do not come at the cost of reasoning accuracy.
Robustness: The method works effectively across different base pruning metrics (Frequency, SEER, EAN, REAP) and different model architectures.
Efficiency: The search process is highly efficient. Using ESAP reduced the search time from 29 hours (using true speculative decoding) to **1.6 hours** (an ~18x speedup) while using fewer GPUs.
Sensitivity: The results highlight that the "best" pruning metric is not universal; however, EvoESAP can improve any metric by optimizing the allocation.

5. Significance

Deployment Efficiency: This work provides a practical pathway to deploy SMoE models with significantly reduced memory footprints (up to 50% fewer experts) without sacrificing the complex generative capabilities required for coding and math.
Paradigm Shift: It challenges the standard practice of uniform pruning in SMoE, establishing that layer-wise budget allocation is a distinct and crucial degree of freedom that must be optimized, not just the selection of experts within layers.
Generalizability: The EvoESAP framework is model-agnostic and metric-agnostic, making it a versatile tool for future SMoE compression research.
Safety & Accessibility: By enabling more efficient deployment, it lowers the barrier for running capable models in resource-constrained environments, though the authors note the need for continued safety evaluation of compressed models.

In summary, EvoESAP solves the "where to prune" problem in SMoE models by combining a novel, efficient similarity metric (ESAP) with evolutionary search to discover non-uniform sparsity schedules that significantly outperform traditional uniform pruning strategies.