REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
This paper introduces REAP, a novel expert pruning method that outperforms existing merging techniques for compressing Mixture-of-Experts models in generative tasks by leveraging router gate-values and activation norms to minimize reconstruction error, achieving near-lossless compression even at 50% expert reduction.