Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Imagine you are the head chef of a massive, high-tech kitchen trying to cook the perfect meal (a super-smart AI) for a billion people. You have a strict budget: you can only use a certain amount of electricity and time (compute) to cook.

In the past, chefs thought the best way to scale up was just to hire more cooks and buy more ingredients. But recently, a new kitchen style called Mixture-of-Experts (MoE) became popular. Instead of every cook working on every dish, you have a team of specialists. For a specific ingredient (like a word in a sentence), only a few "expert" cooks are called in to work, while the rest take a break. This saves a ton of energy.

However, this new kitchen style created a confusing problem for the head chef: How should we split the electricity bill?

Should we spend more money on the Head Chef (the "Attention" layer, who decides which ingredients go together and understands the context)? Or should we spend more on the Specialist Cooks (the "Expert" layers, who actually chop, fry, and season the ingredients)?

The Paper's Big Discovery

This paper is like a new rulebook for that Head Chef. The authors, researchers from HKUST and Ant Group, discovered that there is no single "perfect" way to split the budget. Instead, the perfect split changes depending on two things:

How big your kitchen is getting (Total Compute).
How many specialists you have (Sparsity).

They found a "secret formula" (a power law) that tells you exactly how to adjust your budget as you grow.

The Analogy: The Orchestra vs. The Soloists

To understand their findings, imagine your AI is an orchestra.

The Attention Layer is the Conductor. The conductor makes sure all the instruments play together, understands the tempo, and knows when the violin should talk to the drum.
The Expert Layers are the Soloists. These are the musicians who actually play the complex notes. In an MoE model, only a few soloists play at any given moment.

The Old Way (The Mistake):
For a long time, chefs (developers) thought, "I'll just keep the ratio of Conductor to Soloists the same, no matter how big the orchestra gets." They might say, "Always spend 30% of the budget on the Conductor and 70% on the Soloists."

The New Discovery (The "Law"):
The paper says: "No! That's wrong!"

When the orchestra is small: You need a strong Conductor to keep everyone together. The Soloists don't need much help yet.
When the orchestra gets HUGE: The Conductor can only do so much. If you keep giving the Conductor the same amount of money, they get overwhelmed. But if you give more money to the Soloists, they can learn incredibly complex, specialized skills that make the music sound amazing.

The Rule: As your AI gets bigger, you should shift more of your budget toward the Expert Soloists and less toward the Conductor.

The "Sparsity" Twist

There's a second part to the rule. How many soloists do you have?

Low Sparsity (Many active experts): If you have a huge team of experts working at once, you can afford to throw a lot of money at them. They will use it well.
High Sparsity (Few active experts): If you only activate a tiny handful of experts, giving them too much money is wasteful. They get "full" and can't use the extra power. In this case, you should keep spending more on the Conductor to make sure those few experts are used perfectly.

Why Does This Matter?

Imagine you have a fixed amount of money to build a new AI.

Without this paper: You might build a giant model that is "unbalanced." You might have a massive team of experts but a weak conductor, or vice versa. The result? The AI is expensive but not very smart.
With this paper: You can use their formula to build a model that is perfectly balanced for your specific budget. You get the maximum amount of "smartness" for every dollar you spend.

The Bottom Line

The researchers didn't just guess; they cooked thousands of different "meals" (trained thousands of models) with different budgets and different team sizes. They measured the taste (performance) and found a mathematical pattern.

In simple terms:

"Don't treat your AI like a static machine. As it grows, you must change how you feed it. Give more power to the 'specialists' as the model gets bigger, but adjust this based on how many specialists you have active at once. If you follow this new rule, you can build smarter, cheaper, and more efficient AI."

This is a "scaling law" for the internal wiring of AI, helping engineers stop guessing and start designing with precision.

Here is a detailed technical summary of the paper "Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design."

1. Problem Statement

As Large Language Models (LLMs) scale, Mixture-of-Experts (MoE) architectures have become the standard for increasing model capacity without proportionally increasing inference compute. However, current MoE design relies heavily on heuristics inherited from dense Transformers or fixed ratios between architectural components.

The core problem addressed is the optimal allocation of compute between two critical sub-layers within an MoE Transformer:

Attention Layers ( $C_A$ ): Responsible for global token interaction.
Expert (Feed-Forward) Layers ( $C_E$ ): Responsible for specialized feature processing.

Existing neural scaling laws (e.g., Chinchilla) optimize the balance between total parameters and training data but implicitly assume a fixed internal compute allocation (a static ratio of attention vs. expert FLOPs). This paper argues that this assumption is flawed: the optimal ratio of compute between experts and attention is not constant but evolves with the total training compute budget and the model's sparsity.

2. Methodology

A. Definition of the Allocation Ratio

The authors define a key metric, $r$ , as the ratio of FLOPs dedicated to expert layers versus attention layers:
$r = \frac{C_E}{C_A}$
Under a fixed per-token compute budget $C = C_A + C_E$ , varying $r$ directly controls the architectural balance.

B. Experimental Setup

Architecture: GPT-style decoder-only Transformers with sparse MoE layers.
Variables:
- Total Compute ( $C$ ): Scaled across multiple model sizes (up to 5B active parameters, $10^{21}$ total FLOPs).
- Sparsity ( $S$ ): Defined as the fraction of inactive experts ( $S = (E - E_{act})/E$ ). The study tested sparsity levels ranging from 82.35% to 97.67% (Total experts $E \in \{17, 33, 65, 129\}$ ).
- Control: For each model size and sparsity level, the authors performed controlled sweeps of $r$ (ranging from 0.2 to 1.5) while keeping the total per-token compute budget constant.
Data: A diverse multilingual dataset (60% English, 15% Chinese, 25% Code) to ensure generalization.

C. Theoretical Motivation

The authors propose that the marginal utility of expert compute is diminishing and sparsity-dependent.

In low-sparsity regimes (more experts active), additional expert compute yields higher marginal gains due to diverse subnetwork utilization.
In high-sparsity regimes, expert compute saturates faster, making attention layers relatively more valuable.
This leads to the hypothesis that the optimal ratio $r^*$ follows a power-law dependent on total compute $C$ and sparsity $S$ .

3. Key Contributions

1. Discovery of a Scale-Dependent Optimal Ratio ( $r^*$ )

The paper empirically demonstrates that the optimal FLOPs ratio $r^*$ is not a fixed hyperparameter. Instead, it follows a predictable power-law relationship with total compute:
$r^*(C, S) = \alpha_r(S) \cdot C^{\beta_r(S)}$

Trend: As total compute increases, the optimal allocation shifts toward more expert compute (higher $r$ ).
Sparsity Modulation: The rate of this shift depends on sparsity. Low-sparsity models benefit more aggressively from increased expert allocation as they scale compared to high-sparsity models.

2. Explicit Formulas for Scaling Coefficients

The authors derived closed-form expressions for the scaling coefficients $\alpha_r$ and $\beta_r$ as functions of the fraction of activated experts ($1-S$):

$\alpha_r = 6.7 \times 10^{-5} (1 - S)^{-1.23}$
$\beta_r = 0.24 (1 - S)^{0.21}$
These formulas allow practitioners to calculate the optimal $r$ for any given compute budget and sparsity level without exhaustive search.

3. Extended Scaling Law for MoE

The paper generalizes the Chinchilla scaling law by incorporating the expert-attention trade-off as a first-order variable. They propose an extended loss function:
$L = \frac{a}{N^\alpha} + \frac{b}{D^\beta} + c \cdot \frac{e^{R(1-S)^\gamma}}{N^\lambda} + d \cdot \frac{r}{r + 1 + \tau}$

The final term penalizes deviations from the optimal ratio $r$ , quantifying the efficiency loss caused by misallocation.
This framework allows for compute-optimal design that accounts for internal architectural distribution, not just total parameter count and data.

4. Key Results

Loss Surface Analysis: Experiments revealed a smooth, convex "valley" in the loss landscape with respect to $r$ . The location of this minimum ( $r^*$ ) shifts monotonically to higher values as compute increases.
Validation: The proposed power-law for $r^*$ was validated across multiple sparsity levels. The fitted curves showed high agreement with empirical data points.
Generalization: The extended scaling law (Equation 2) successfully predicted loss on held-out datasets (specifically a sparsity level excluded from training the coefficients), demonstrating robust out-of-sample generalization.
Efficiency Gains: Models designed using the derived optimal $r^*$ outperformed models using fixed or heuristic ratios, particularly at large scales. Misallocation (e.g., keeping a fixed $r$ as scale increases) leads to measurable performance degradation.

5. Significance and Implications

Paradigm Shift in MoE Design: The paper moves MoE design from static heuristics (e.g., "use 10% experts") to dynamic, compute-aware co-scaling. It proves that the internal architecture must evolve as the model scales.
Resource Optimization: For industrial settings with fixed GPU budgets, this law provides a precise guideline for allocating resources between attention and expert layers to maximize performance.
Theoretical Advancement: It extends the Chinchilla framework by treating internal compute allocation as a scaling variable, filling a critical gap in the theoretical understanding of sparse models.
Practical Guidelines: The derived formulas ( $r^*(C, S)$ ) offer a "plug-and-play" method for engineers to tune MoE models, ensuring that additional compute is translated into effective modeling capacity rather than wasted on suboptimal layer balancing.

In summary, this work establishes that optimal MoE performance requires a dynamic balance between attention and expert layers, governed by a scalable law dependent on total compute and sparsity, fundamentally changing how large-scale sparse models should be architected.