Speculating Experts Accelerates Inference for Mixture-of-Experts

Imagine you are running a massive, high-end library (the AI model) that has thousands of specialized experts (the "Mixture of Experts" or MoE) sitting in a giant warehouse across town (the CPU memory). You are sitting in your office (the GPU) trying to answer questions.

Normally, here is how it works:

You ask a question.
A librarian (the "Router") looks at your question and decides, "Okay, for this specific math problem, we need the Math Expert and the Logic Expert."
The librarian runs across town to the warehouse, grabs those two specific experts, brings them back to your office, and then you do the work.
Then you ask the next question. The librarian runs back, grabs a different pair of experts (maybe a History Expert and a Poetry Expert), brings them back, and you work again.

The Problem: The trip across town (CPU to GPU transfer) is slow. The actual work you do with the experts is fast. So, you spend most of your time waiting for the librarian to run back and forth. This is called the "I/O bottleneck."

The Paper's Solution: "Speculating Experts"

The authors of this paper came up with a clever trick to stop you from waiting. They call it Expert Prefetching.

Instead of waiting for the librarian to finish the current task before running to get the next set of experts, they use a "crystal ball" (internal model signals) to guess who you will need next while you are still working on the current task.

Here is the analogy:

1. The Crystal Ball (The Quasi-Hidden State)

The paper suggests that the way you are currently thinking (your "internal state") actually contains a strong hint about what you will need next.

Old Way: Wait until you finish the math problem, look at the result, then decide to call the History Expert.
New Way: While you are solving the math problem, the system looks at your current thought process and says, "Hey, based on how you're thinking right now, you're almost certainly going to need the History Expert next."

2. The Overlap (The Magic Trick)

This is where the speedup happens.

The Old Way: You finish math $\rightarrow$ Librarian runs to get History $\rightarrow$ You wait $\rightarrow$ You start History.
The New Way: You are solving math. At the same time, the librarian is already running to the warehouse to grab the History Expert. By the time you finish the math problem, the History Expert is already sitting at your desk, ready to go.

The paper proves that you can do the "running" (data transfer) and the "thinking" (computation) at the exact same time. This turns a slow, stop-and-go process into a smooth, continuous flow.

3. What if the Crystal Ball is Wrong? (Speculative Execution)

Sometimes, the crystal ball might guess wrong. Maybe you needed the History Expert, but the system guessed the Science Expert.

The Old Approach: If the guess is wrong, you have to stop, send the librarian back to the warehouse to get the real expert, and wait again. This defeats the purpose.
The Paper's Approach: They found that even if the guess is slightly wrong, you can often just use the guessed expert anyway without ruining the answer. It's like if you needed a History book but grabbed a Science book; you can still read the Science book and get a useful answer, or the system is smart enough to realize the Science book is close enough to the History book that the final result is still accurate.

If the guess is really bad (which happens in the very first few layers of the AI), they use a tiny, lightweight "predictor" (a small neural network) to make a better guess, ensuring the librarian grabs the right person.

The Results

By using this "guess while you work" strategy:

Speed: They reduced the time it takes to generate each word by 5% to 14%. That might not sound like much, but in the world of AI, that's a huge win.
Accuracy: The answers the AI gives are just as good as before. The "guessing" didn't make the AI dumber.
Accessibility: This makes it possible to run these massive, super-smart AI models on regular computers (like your laptop or a single graphics card) without needing a supercomputer.

In Summary

Think of this paper as teaching a busy chef (the AI) how to prep the ingredients for the next dish while they are still cooking the current one. Instead of stopping to chop onions after the soup is done, they chop the onions while the soup simmers. The result? Dinner is served much faster, and the food tastes just as good.

1. Problem Statement

Context: Mixture-of-Experts (MoE) models are the dominant architecture for scaling Large Language Models (LLMs) while maintaining sparse activations. However, in resource-constrained inference environments (e.g., consumer GPUs with limited VRAM), the massive parameter count of MoE models necessitates offloading expert weights to CPU RAM.
The Bottleneck: During the decoding phase, the model must transfer specific expert weights from CPU to GPU on-demand. This creates a severe I/O bottleneck where CPU-GPU memory transfer latency dominates the Time Per Output Token (TPOT). In some configurations (e.g., Qwen3-30B-A3B on an A6000), memory transfers account for 84–88% of the TPOT, while actual computation accounts for a minor fraction.
Limitations of Existing Solutions:

On-demand loading: Synchronous transfers block computation, causing significant idle time.
Traditional Prefetching: Prior works treat predictions as "cache hints." If a prediction is wrong, the system must re-fetch the correct expert, breaking the overlap between computation and memory transfer.
Fine-tuning requirements: Some approaches require modifying the model architecture or full fine-tuning, limiting their applicability to pre-trained models.

2. Methodology

The authors propose a Speculative Expert Prefetching scheme that leverages internal model representations to predict future expert selections, enabling speculative execution where predicted experts are executed immediately without waiting for ground-truth verification.

A. Signal Extraction: The Quasi-Hidden State

The core innovation is identifying a signal within the model that predicts the next layer's routing decision more accurately than the standard residual stream.

Default Vector ( $d_l$ ): A pre-computed vector representing the average activation of an expert at a specific layer.
Quasi-Hidden State ( $q_l$ ): Defined as the normalized sum of the post-attention residual stream ( $r_l$ ) and the layer-level default vector ( $d_l$ ):
$q_l = \text{LN}_{l+1}(d_l + r_l)$
This state incorporates "expert-conditioned bias," allowing it to better approximate the drift between the current layer's input and the next layer's router input compared to the raw residual stream.

B. Speculative Execution Strategy

Unlike traditional caching, the system executes the predicted experts immediately.

Prediction: At layer $l$ , the system uses $q_l$ to predict the Top-K experts for layer $l+1$ .
Overlap: While the GPU computes layer $l$ , the CPU asynchronously transfers the weights for the predicted layer $l+1$ experts.
Execution: The GPU executes layer $l+1$ using the prefetched weights.
Accuracy Preservation: The paper demonstrates that for many architectures, executing the predicted experts yields results sufficiently close to the true router-selected experts, eliminating the need to re-fetch and thus preserving the compute-memory overlap.

C. Handling High-Drift Layers (Neural Estimators)

For models where the quasi-hidden state signal is weak (e.g., early layers of Qwen3-30B-A3B), the authors introduce a Lightweight Neural Estimator.

This is a shallow feed-forward network trained via knowledge distillation to map the quasi-hidden state directly to the router logits of the next layer.
It acts as a "router-free" inference path for specific layers, significantly improving hit rates without the serial overhead of the full router.

3. Key Contributions

Parameter-Free Prefetching: Identification of the "quasi-hidden state" as a robust signal for predicting future routing decisions across diverse MoE architectures without modifying the pre-trained model weights.
Speculative Execution: A novel execution strategy where predicted experts are executed rather than treated as cache misses. This maintains high downstream task accuracy while maximizing compute-memory overlap.
Lightweight Estimators: A method to boost prediction accuracy in high-drift layers using small, trainable neural networks, offering a hybrid approach for models with varying representational stability.
Optimized Implementation: Integration into YALIS, an open-source inference engine, demonstrating real-world performance gains.

4. Experimental Results

The approach was evaluated on multiple MoE models (Qwen3-30B-A3B, GPT-OSS-20B/120B, GLM-4.7-Flash) across various hardware configurations (A6000, A100, GH200).

Performance (TPOT Reduction):
- Achieved a 5–14% reduction in Time Per Output Token (TPOT) compared to on-demand loading.
- Gains were most significant on consumer-grade GPUs (A6000) and longer sequence lengths, where copy time dominates compute time.
- The implementation closely approached the theoretical upper bound of speedup (where copy time is fully overlapped with compute time).
Accuracy:
- Router-PF (Quasi-Hidden State): Maintained high accuracy on GPT-OSS models. On Qwen3-30B-A3B, it showed some degradation in math-heavy tasks (AIME, GSM8k) due to early-layer drift.
- Est-PF & Hybrid-PF: Using the neural estimator (or a hybrid approach) on Qwen3-30B-A3B recovered most of the accuracy gap. For example, the Hybrid approach recovered ~37% of the accuracy gap on GSM8k compared to the baseline router-prefetching.
Hit Rates:
- The quasi-hidden state achieved ~90% recall@k on stable layers (e.g., Qwen3-30B-A3B layers 3–48).
- The neural estimator improved hit rates in early layers by up to 25% over the baseline router-based prediction.

5. Significance

Democratizing Large Models: By reducing the I/O bottleneck, this technique makes running massive, open-source MoE models feasible on consumer hardware (single consumer GPUs) without requiring expensive multi-GPU clusters.
Efficiency: It transforms the inference bottleneck from a sequential CPU-GPU transfer problem into a parallelized compute-transfer problem, significantly improving throughput.
Generalizability: The method works on existing pre-trained models without requiring retraining of the main model, making it a drop-in optimization for the open-source ecosystem.
Future Directions: The work opens avenues for "router-free" inference (replacing the router entirely with an estimator) and multi-layer prefetching to further optimize memory usage in edge devices.

In summary, the paper presents a highly effective, low-overhead technique to accelerate MoE inference in memory-constrained settings by intelligently overlapping data transfers with computation through speculative execution, backed by both theoretical analysis and empirical validation.