Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
This paper establishes a new scaling law for Mixture-of-Experts models by deriving an explicit power-law formula for the optimal compute allocation ratio between expert and attention layers, enabling more efficient model design under fixed computational budgets.