Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

The Big Picture: The "Expert Kitchen" Problem

Imagine you are running a massive, high-end restaurant (a Large Language Model) with 100 different specialized chefs (Experts).

Some chefs are amazing at baking.
Some are masters of grilling.
Some are experts at making sushi.

In a standard setup, you keep all 100 chefs in the kitchen at all times. This is great for quality, but it's impossible if you only have a tiny kitchen (like a mobile phone or a laptop with limited memory). You can't fit 100 chefs in a studio apartment!

The Solution (Expert Offloading):
To solve this, you decide to keep only 5 chefs in the small kitchen (Fast Memory/GPU) and store the other 95 in a huge warehouse across town (Slow Memory/CPU). When a customer orders a dish, you check: "Do we have the right chef in the kitchen?"

Yes: Great! Cook the dish instantly.
No: You have to run across town to fetch the chef, or cook it slowly in the warehouse. This takes time and slows everything down.

The Core Question: Do Customers Stick to One Chef?

The researchers asked a simple question: Do customers tend to order from the same few chefs for a long time, or do they jump around randomly?

Scenario A (High Consistency): A customer orders a steak, then a steak sauce, then a steak dessert. They stay in the "Grill" zone. You only need the Grill Chef in the kitchen. This is easy to manage.
Scenario B (Low Consistency): A customer orders a sushi roll, then a pizza, then a salad, then a cake. They jump between zones constantly. You have to keep running back and forth to the warehouse to swap chefs. This is chaotic and slow.

The paper calls this "Local Routing Consistency." It measures how likely a model is to stick with the same group of experts for a sequence of words (tokens).

The Two New Rulers (Metrics)

To measure this "stickiness," the authors invented two tools:

SRP (The "Best Guess" Score):
Imagine you are a manager trying to predict which chefs will be needed for the next 10 orders. If you could just pick one group of chefs to stay in the kitchen for those 10 orders and get it right most of the time, your model has High Consistency. If you have to swap chefs every single order, your model has Low Consistency.
- Analogy: If you can pack a suitcase with just "Beach clothes" and be happy for the whole trip, you have high consistency. If you need "Winter gear," "Swimwear," and "Formal wear" all in the same week, you have low consistency.
SCH (The "Cache Hit" Score):
This simulates a real-world scenario where your kitchen has a strict limit (e.g., only 5 chefs allowed). It asks: "If we use a smart system to swap chefs based on what's coming up next, how often do we get a 'Hit' (the chef is already there)?"
- Analogy: How often does your smart fridge know you're about to make toast so it keeps the toaster ready?

What They Discovered

The team tested 20 different "restaurants" (AI models) and found some surprising things:

1. Not All Models Are Created Equal
Some models (like LLaMA-MoE-v2 and OLMoE) are very "sticky." Once they start talking about a topic, they keep using the same experts. These are perfect for small devices because you can cache a few experts and rarely need to run to the warehouse.
Other models (like SwitchTransformers or Jamba) are "jumpy." They switch experts constantly. Trying to run these on a phone with offloading would be a disaster because the system would be stuck fetching chefs from the warehouse all day.

2. The Trade-off: Balance vs. Consistency
Usually, you want all chefs to be equally busy (Load Balance) so no one gets tired. But the paper found that high consistency often means low balance.

Analogy: If the "Grill Chef" is the only one working for the next hour, they are super busy (unbalanced), but you don't need to swap chefs (high consistency).
Good News: Some models manage to be both consistent and globally balanced. They have "Specialized Chefs" who handle specific topics (like Math or Coding). When the topic changes, a different specialist takes over, but they stay consistent within that topic.

3. The "Shared Chef" Trap
Some models have "Shared Experts" (Chefs who do everything). The paper found that having these generalists actually hurts consistency.

Analogy: If you have a "Jack of All Trades" chef who does everything, the manager might send random orders to them, making it hard to predict who is needed next. It's better to have distinct specialists.

4. The Magic Number: 2x
The researchers found the "Sweet Spot" for memory. If your model activates 2 experts at a time, you should keep 4 experts in your cache (Fast Memory).

Rule of Thumb: Keep a cache size that is twice the number of active experts. This gives you the best balance between speed and memory usage.

Why This Matters

This paper is a guidebook for engineers building AI for phones, tablets, and edge devices.

If you are building an AI for a phone: Don't just pick the smartest model. Pick the one with High Local Routing Consistency. It will run faster and use less battery because it won't be constantly swapping data.
If you are designing a new AI: Avoid "Shared Experts" and try to encourage "Specialized Experts" (like Math experts or Coding experts). This makes the model friendlier to small devices.

Summary in One Sentence

Some AI models are like loyal customers who stick to one menu, making them easy to run on small phones; others are like indecisive diners who change orders every second, making them a nightmare to run without a massive computer. This paper teaches us how to tell the difference and how to design better models for the future.

1. Problem Statement

Mixture-of-Experts (MoE) models enable the scaling of Large Language Models (LLMs) by sparsely activating a subset of parameters during inference. However, deploying these models on memory-constrained devices (e.g., mobile phones) requires expert offloading, where only a subset of experts is cached in fast memory (GPU), while others reside in slow memory (CPU/disk) and are loaded on demand.

The efficiency of expert offloading relies heavily on local routing consistency: the degree to which consecutive tokens in a sequence activate the same set of experts. If routing is inconsistent (experts change frequently), the system suffers from frequent CPU offloads or cache evictions, severely degrading inference speed. While some prior work observed this locality in specific models (e.g., Mixtral), it remains unclear:

How consistent is this routing across diverse MoE architectures?
What architectural or training factors influence this consistency?
How can we quantify this property to guide model design and cache management?

2. Methodology

The authors propose a systematic framework to measure and analyze Local Routing Consistency (LRC) across 20 MoE-based LLMs (ranging from 3B to 57B parameters) and a series of controlled "toy" models.

A. Proposed Metrics

To quantify LRC, the paper introduces two metrics:

Segment Routing Best Performance (SRP):
- Definition: Measures how well a simplified "segment router" (which selects a fixed group of experts for a block of $m$ tokens) can approximate the decisions of the original token-level router.
- Mechanism: It calculates the upper-bound F1 score for predicting expert activation within a segment. It is parameter-free (except for segment length) and analyzes consistency at both the single-expert and expert-group levels.
- Utility: Provides fine-grained analysis of intrinsic routing patterns without cache constraints.
Segment Cache Best Hit Rate (SCH):
- Definition: Measures the hit rate of an "oracle" expert cache with a fixed size limit (ratio $\rho$ of cache size to active experts).
- Mechanism: The oracle evicts the expert that will be activated least frequently in the next $m$ tokens.
- Utility: Bridges the gap between theoretical consistency and practical offloading efficiency, accounting for hard memory limits.

B. Experimental Setup

Real Models (REAL): 20 diverse MoE LLMs (e.g., Mixtral, DeepSeek-V2, Qwen3, Jamba, OLMoE) covering various sizes and architectures.
Toy Models (TOY): A series of OLMoE-based models pre-trained from scratch with controlled modifications to specific architectural hyperparameters (e.g., number of shared experts, load balance loss coefficients, expert granularity) to isolate causal factors.
Data: A corpus of 22,528 samples (512 tokens each) spanning 11 domains (generic text, code, math, science, etc.).

3. Key Contributions & Findings

A. Quantitative Analysis of Consistency

Variance Across Models: There is a significant disparity in LRC. Models like LLaMA-MoE-v2 and OLMoE exhibit high long-term consistency (SRP > 0.5 for segment length 16), while models like SwitchTransformers and NLLB-MoE show very low consistency.
Short-term vs. Long-term: Many models show high consistency for short segments ( $m=4$ ) but degrade significantly as the segment length increases ( $m \ge 16$ ). Only a few models maintain high consistency over longer contexts.

B. Factors Influencing Local Routing Consistency

Through toy model experiments, the authors identified key drivers:

Trade-off with Local Load Balance: There is a strong trade-off between local routing consistency and local load balance. Models that force balanced expert usage (high load balance) tend to have low LRC. Conversely, models with high LRC often exhibit imbalanced local activation (some experts are rarely used locally).
- Insight: High LRC models can still achieve global load balance because different queries (topics) activate different expert sets, even if a single query is consistent.
Shared Experts: The presence of shared experts (experts always active regardless of routing) significantly reduces LRC. Shared experts reduce the "expert combination space," limiting the router's ability to make distinct local adjustments.
Expert Combination Space: Models with a larger space for expert combinations (more active experts relative to total) tend to have higher LRC.
Interleaved Dense Layers: Replacing MoE layers with dense layers (or having dense layers in the architecture) has a less significant impact on LRC compared to shared experts.

C. Expert Specialization

Domain Specialization: Models with domain-specialized experts (e.g., experts dedicated to math or code) show higher LRC. When the context matches a specific domain, the same specialized experts are consistently activated.
Vocabulary Specialization: Specialization based on specific token IDs has a negligible or negative impact on LRC compared to domain specialization.
Correlation: There is a strong positive correlation between an expert's domain specialization and the model's overall LRC.

D. Optimal Cache Configuration

Cache Size: Analysis of SCH reveals that a cache size of approximately 2x the number of active experts ( $\rho \approx 2$ ) offers the optimal balance between cache effectiveness (hit rate) and efficiency (memory usage).
Predictive Power: SCH is highly correlated with the hit rates of standard caching algorithms (LRU, LFU), validating its use as a proxy for real-world offloading performance.

4. Significance and Implications

Design Guidelines for MoE: The paper provides actionable guidelines for designing MoE models intended for edge deployment. To maximize offloading efficiency, designers should:
- Avoid shared experts if possible.
- Prioritize domain specialization over strict local load balancing.
- Accept that high local routing consistency often implies local load imbalance (which is acceptable if global balance is maintained).
System Optimization: For system developers, the findings suggest that cache sizes of ~2x active experts are sufficient for most models. Furthermore, models with low LRC (e.g., SwitchTransformers) may not benefit significantly from standard expert offloading strategies and might require different optimization approaches.
Metric Standardization: The introduction of SRP and SCH provides a standardized, quantitative way to evaluate MoE models for memory-constrained deployment, moving beyond anecdotal observations.

Conclusion

The paper establishes that not all MoE models are suitable for expert offloading. It identifies Local Routing Consistency as the critical property determining offloading efficiency. By linking architectural choices (shared experts, load balance) and expert specialization to this property, the authors enable the design of MoE models that are inherently friendly to memory-constrained deployment without sacrificing inference speed.