EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

The paper introduces EvoESAP, an evolutionary framework that optimizes non-uniform layer-wise sparsity allocations for Sparse Mixture-of-Experts models using a stable, speculative-decoding-inspired metric called ESAP, significantly improving open-ended generation performance while maintaining competitive accuracy compared to traditional uniform pruning methods.

Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, Xin Yuan

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you have a massive, incredibly smart library (a Large Language Model) filled with thousands of specialized librarians (called "Experts"). When you ask a question, the library doesn't use all the librarians at once. Instead, a smart manager (the "Router") picks just the top few experts who are best suited to answer your specific question. This is called a Sparse Mixture-of-Experts (MoE) model. It's brilliant because it's fast and efficient, but there's a catch: even though you only use a few librarians at a time, you still have to keep all of them in the building. This takes up a huge amount of space (memory) and makes the building expensive to run.

The goal of this paper is to shrink the library by firing some of the librarians so the building fits in a smaller space, without making the library forget how to write good stories or solve math problems.

The Problem: The "One-Size-Fits-All" Mistake

Previous attempts to shrink these libraries tried to fire librarians in a very simple way: "Let's fire 25% of the librarians in every single room, no matter what."

Imagine a library with 16 rooms.

  • Room 1 might be the "Introduction" room. It needs very few experts.
  • Room 10 might be the "Complex Logic" room. It needs almost all its experts to function.
  • Room 16 might be the "Conclusion" room. It needs a different mix.

The old method treated every room exactly the same. It fired 25% of the staff in the "Complex Logic" room (disaster!) and 25% in the "Introduction" room (maybe fine, but maybe we fired the one genius who was needed). This "Uniform" approach often made the library worse at creative tasks like writing code or solving math puzzles, even if it was okay at simple multiple-choice questions.

The Solution: EvoESAP (The Smart Architect)

The authors propose a new method called EvoESAP. Think of it as hiring a Smart Architect who doesn't just fire people randomly, but carefully decides where to cut staff based on how important each room is.

Here is how it works, step-by-step:

1. The "Who to Fire" List (Fixed)

First, the architects look at every single room and rank the librarians from "Least Useful" to "Most Useful." Let's say they decide that in Room 10, Librarian A is the worst, and Librarian B is the best. This ranking stays the same for now.

2. The "How Many to Fire" Puzzle (The Search)

Now comes the tricky part. We have a budget: "We need to fire a total of 500 librarians across the whole building."

  • Old Way: Fire 31 librarians from every single room (500 / 16 rooms).
  • EvoESAP Way: The architect uses a Genetic Algorithm (like natural selection) to try thousands of different combinations.
    • Try 1: Fire 50 from Room 1, 0 from Room 10, 100 from Room 15.
    • Try 2: Fire 10 from Room 1, 80 from Room 10, 20 from Room 15.
    • Try 3: ...and so on.

The goal is to find the specific mix where you fire the least important people from the most critical rooms, and the most important people from the least critical rooms, all while hitting your total budget.

3. The "Magic Test" (ESAP)

How does the architect know if a new layout works? Usually, you'd have to let the library run for hours to see if it still answers questions well. That's too slow.

Instead, they invented a Magic Test called ESAP (Expected Speculative Acceptance Proxy).

  • The Analogy: Imagine the original library is a "Master Chef." The new, shrunk library is a "Student Chef."
  • Normally, to test the student, you'd have them cook a whole meal and taste it (slow and expensive).
  • The ESAP Trick: The Master Chef looks at the Student's ingredients and plan before they even start cooking. The Master Chef asks, "If you were to cook this, how likely is it that I would agree with your choices?"
  • If the Student's plan matches the Master's plan perfectly, the test score is high. If the Student is going to mess up, the score is low.
  • This test is incredibly fast and tells the architect exactly which "firing plan" keeps the library smartest.

The Results: Why It Matters

The paper tested this on several huge AI models (like OLMoE, ERNIE, and Qwen). The results were impressive:

  • Better at Creative Stuff: When they used the "Smart Architect" (EvoESAP) instead of the "One-Size-Fits-All" method, the models got much better at writing code and solving math problems. In one case, the math score jumped by nearly 20%.
  • Same at Simple Stuff: They didn't lose any ability to answer simple multiple-choice questions.
  • The "Non-Uniform" Secret: The best layouts weren't uniform. Sometimes, the architect decided to fire almost everyone from the first few rooms and keep almost everyone in the middle rooms. This "Non-Uniform" distribution was the key to keeping the model smart.

Summary

Think of EvoESAP as a way to downsize a company without firing the wrong people. Instead of firing 10% of the staff from every department (which might kill the R&D team while barely touching the janitorial staff), it analyzes the whole company and says, "We can fire 50% of the janitors and 10% of the engineers, but we must keep 95% of the R&D team."

By using a fast "Magic Test" to find the perfect balance, they managed to shrink these giant AI models significantly while making them smarter at the things that matter most: writing, coding, and reasoning.