MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

The Big Problem: The Over-Staffed Kitchen

Imagine a massive, high-end restaurant (the Multimodal Large Language Model or MLLM) that can cook anything: write poems, analyze X-rays, or describe a video. To handle this, the kitchen uses a "Mixture of Experts" (MoE) system.

Instead of one giant chef doing everything, the kitchen has hundreds of specialized chefs (called Experts).

One chef is great at chopping vegetables (Text).
One is great at plating desserts (Images).
One is great at reading recipes (Videos).

The Catch: In the current setup, for every single order (every word or pixel), the kitchen manager forces all the chefs to show up and taste the dish, even if only two of them actually know what to do. This is incredibly slow, expensive, and wasteful. It's like calling in the entire orchestra to play a single note.

The Failed Fix: The "One-Size-Fits-All" Rule

Scientists tried to fix this by creating a rule: "If a chef doesn't seem very interested in the order, send them home." This is called Expert Skipping.

However, when they applied this rule to the Multimodal kitchen, it backfired. The performance dropped because the rule was too dumb. It treated a "Text Order" and a "Video Order" exactly the same, and it treated the "Junior Chefs" (early layers) the same as the "Head Chefs" (deep layers).

The Solution: MoDES (The Smart Kitchen Manager)

The authors created MoDES (Multimodal Dynamic Expert Skipping). Think of MoDES as a super-smart, adaptive kitchen manager who makes two crucial observations:

1. Not All Chefs Are Created Equal (The "Layer" Insight)

The Insight: In the early stages of cooking (shallow layers), the chefs are doing the heavy lifting of prep work. If you skip them, the whole dish fails. In the later stages (deep layers), the chefs are just adding final garnishes. Skipping a few of them doesn't hurt much.
The Analogy: Imagine building a house. If you skip the foundation crew (shallow layers), the house collapses. If you skip a few people painting the trim (deep layers), the house still stands.
MoDES' Fix: It uses a Global-Modulated Local Gating system. It knows that "Foundation Chefs" are more important than "Trim Chefs" and protects the important ones while letting the less critical ones go.

2. Text and Images Need Different Rules (The "Modality" Insight)

The Insight: Text tokens and Image tokens behave differently. Text is like a delicate soufflé; it needs precise, careful handling by many experts. Images are like a big pot of soup; they are more robust, and many experts are actually just "redundant" (doing the same thing).
The Analogy: If you are editing a legal contract (Text), you need a strict, detailed review. If you are organizing a pile of rocks (Images), you can be a bit more casual.
MoDES' Fix: It uses Dual-Modality Thresholding. It has two different rules:
- Rule A (Text): "Be strict. Only send home the chefs who are 90% sure they aren't needed."
- Rule B (Images): "Be loose. Send home anyone who isn't 50% sure they are needed."
- This allows the system to skip way more experts for images without ruining the result.

The Secret Weapon: The "Frontier Search"

To figure out exactly how strict or loose these rules should be, the team had to find the "Goldilocks" zone: skip enough to be fast, but not so much that the food tastes bad.

Usually, finding this perfect balance takes days of trial and error. MoDES uses a Frontier Search Algorithm.

The Analogy: Imagine you are looking for the perfect temperature for a shower. A normal person turns the hot and cold knobs randomly until they find it (takes forever). MoDES is like a robot that knows the water gets hotter as you turn the knob one way and colder the other. It slides along the "edge" of the perfect zone instantly.
The Result: What used to take 2 days to calculate now takes 2 hours.

The Results: Fast, Cheap, and Delicious

When they tested MoDES on real-world models (like Qwen3-VL and Kimi-VL):

Speed: It made the models 2x faster at reading inputs and 1.2x faster at writing answers.
Efficiency: It successfully skipped 88% of the experts (sending 88% of the chefs home!) while actually improving the accuracy in some cases.
Why? Because by removing the "distracted" or "redundant" chefs, the remaining experts could focus better, and the computer didn't waste energy on unnecessary work.

Summary

MoDES is a smart system that stops Multimodal AI models from wasting energy. Instead of treating every task the same, it realizes that:

Early steps matter more than late steps.
Images are easier to skip than text.
Finding the right balance can be done instantly.

It's like upgrading from a chaotic kitchen where everyone shouts at once, to a streamlined, high-speed kitchen where only the right chefs show up for the right job, making the food faster and tastier.

1. Problem Statement

Mixture-of-Experts (MoE) Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance in vision-language tasks by decoupling model size from computational cost through sparse activation. However, they still suffer from significant inference inefficiencies because a fixed number of experts are activated for every token, regardless of necessity.

Existing expert skipping methods (designed primarily for unimodal LLMs) attempt to deactivate redundant experts based on local routing probabilities. When applied directly to MLLMs, these methods cause severe performance degradation (e.g., >10% accuracy drop at high skipping ratios). The paper identifies two critical reasons for this failure:

Global Contribution Disregard: Existing methods rely solely on local routing probabilities within a single layer. They ignore the fact that experts in shallow layers have a disproportionately larger impact on the final output compared to those in deeper layers. Skipping shallow experts aggressively leads to error explosion.
Modality Gap: Text and vision tokens exhibit distinct behaviors when passing through Feed-Forward Networks (FFNs). Vision tokens induce smaller updates (higher cosine similarity pre/post-FFN) and are more orthogonal to weights, implying higher redundancy. Text tokens undergo larger updates. Treating both modalities with a uniform skipping strategy fails to account for these differences.

2. Methodology: MoDES

The authors propose MoDES (Multimodal Dynamic Expert Skipping), a training-free framework that adaptively skips experts while preserving accuracy. It consists of three core components:

A. Globally-Modulated Local Gating (GMLG)

To address the layer-wise contribution imbalance, MoDES introduces a mechanism that combines global layer importance with local routing probabilities.

Offline Calibration: The system calculates a global importance weight, $\alpha^{(l)}$ , for each layer $l$ by measuring the Kullback-Leibler (KL) divergence between the original model's output and a version where experts in that specific layer are skipped.
Importance Scoring: The final importance score $s^{(l)}_i$ for an expert $i$ in layer $l$ is computed as:
$s^{(l)}_i = \alpha^{(l)} \cdot \pi^{(l)}_i$
where $\pi^{(l)}_i$ is the local routing probability. This ensures that experts in critical shallow layers are preserved even if their local probability is moderate, while deeper layers can be skipped more aggressively.

B. Dual-Modality Thresholding (DMT)

To address the modality gap, MoDES employs separate thresholds for text and vision tokens.

Mechanism: Instead of a single global threshold, the method defines two thresholds: $\tau_t$ for text tokens and $\tau_v$ for vision tokens.
Decision Rule: An expert is skipped if its importance score falls below the threshold corresponding to the token's modality:
$\text{Skip if } s^{(l)}_i < (\tau_t \cdot I_t + \tau_v \cdot I_v)$
where $I_t$ and $I_v$ are indicator functions. This allows the system to skip more vision experts (which are more redundant) while retaining more text experts.

C. Frontier Search Algorithm

Determining the optimal pair of thresholds $(\tau_t, \tau_v)$ is a constrained optimization problem.

Objective: Minimize performance loss (KL divergence) while satisfying a target expert skipping ratio $\rho$ .
Efficiency: The authors exploit the monotonicity of performance loss and efficiency with respect to thresholds. Instead of an exhaustive search ( $O(N D^2)$ ), they use a Frontier Search algorithm ( $O(N D)$ ).
Result: This reduces the threshold search time from several days to a few hours (approx. 45x speedup in search time) without compromising the optimality of the solution.

3. Key Contributions

First Training-Free Framework for MoE MLLMs: MoDES is the first method to enable efficient expert skipping specifically for multimodal models without requiring retraining.
Novel Insights: The paper provides empirical evidence that expert contributions are layer-dependent (shallow > deep) and modality-dependent (vision > text redundancy), insights previously overlooked in unimodal skipping literature.
Algorithmic Innovation: The introduction of the Frontier Search algorithm significantly accelerates the hyperparameter tuning process for threshold selection.
Implementation Efficiency: The method is integrated directly into the router kernel using branch-free masked comparisons and Grouped GEMM, ensuring minimal overhead during inference.

4. Experimental Results

The authors evaluated MoDES on 3 model series (Kimi-VL, Qwen3-VL-MoE, InternVL-3.5) across 13 benchmarks (including image and video understanding tasks like MMBench, VideoMME, and ChartQA).

Accuracy Preservation:
- At an 88% expert skipping ratio on Qwen3-VL-MoE-30B-A3B-Instruct, MoDES retained 97.33% of the original model's performance, whereas previous methods (NAEE, MC-MoE, DiEP) dropped to ~86.66%.
- MoDES achieved a performance boost of up to 10.67% over baselines at high skipping ratios.
Inference Speed:
- Prefill Phase: Achieved a 2.16× speedup.
- Decoding Phase: Achieved a 1.26× speedup.
Quantization Compatibility: MoDES combines effectively with quantization (e.g., 1.5-bit and 2.5-bit), outperforming baselines like MC-MoE in maintaining accuracy under extreme compression.
Search Efficiency: The frontier search reduced the threshold optimization time from ~2 days to <2 hours for 30B parameter models.

5. Significance

MoDES addresses a critical bottleneck in the deployment of large-scale MoE MLLMs. By moving beyond "one-size-fits-all" skipping strategies, it demonstrates that context-aware (layer-specific) and modality-aware pruning is essential for maintaining high performance in multimodal settings. The framework offers a practical, training-free path to significantly reduce inference costs (both latency and compute) for next-generation vision-language models, making them more viable for real-time applications and resource-constrained environments.