MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

This paper introduces MoDES, a training-free framework that accelerates Mixture-of-Experts Multimodal Large Language Models by employing a globally-modulated local gating mechanism and dual-modality thresholding to adaptively skip redundant experts, thereby significantly improving both inference speed and accuracy compared to existing methods.

Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang

Published 2026-02-24
📖 5 min read🧠 Deep dive

The Big Problem: The Over-Staffed Kitchen

Imagine a massive, high-end restaurant (the Multimodal Large Language Model or MLLM) that can cook anything: write poems, analyze X-rays, or describe a video. To handle this, the kitchen uses a "Mixture of Experts" (MoE) system.

Instead of one giant chef doing everything, the kitchen has hundreds of specialized chefs (called Experts).

  • One chef is great at chopping vegetables (Text).
  • One is great at plating desserts (Images).
  • One is great at reading recipes (Videos).

The Catch: In the current setup, for every single order (every word or pixel), the kitchen manager forces all the chefs to show up and taste the dish, even if only two of them actually know what to do. This is incredibly slow, expensive, and wasteful. It's like calling in the entire orchestra to play a single note.

The Failed Fix: The "One-Size-Fits-All" Rule

Scientists tried to fix this by creating a rule: "If a chef doesn't seem very interested in the order, send them home." This is called Expert Skipping.

However, when they applied this rule to the Multimodal kitchen, it backfired. The performance dropped because the rule was too dumb. It treated a "Text Order" and a "Video Order" exactly the same, and it treated the "Junior Chefs" (early layers) the same as the "Head Chefs" (deep layers).

The Solution: MoDES (The Smart Kitchen Manager)

The authors created MoDES (Multimodal Dynamic Expert Skipping). Think of MoDES as a super-smart, adaptive kitchen manager who makes two crucial observations:

1. Not All Chefs Are Created Equal (The "Layer" Insight)

  • The Insight: In the early stages of cooking (shallow layers), the chefs are doing the heavy lifting of prep work. If you skip them, the whole dish fails. In the later stages (deep layers), the chefs are just adding final garnishes. Skipping a few of them doesn't hurt much.
  • The Analogy: Imagine building a house. If you skip the foundation crew (shallow layers), the house collapses. If you skip a few people painting the trim (deep layers), the house still stands.
  • MoDES' Fix: It uses a Global-Modulated Local Gating system. It knows that "Foundation Chefs" are more important than "Trim Chefs" and protects the important ones while letting the less critical ones go.

2. Text and Images Need Different Rules (The "Modality" Insight)

  • The Insight: Text tokens and Image tokens behave differently. Text is like a delicate soufflé; it needs precise, careful handling by many experts. Images are like a big pot of soup; they are more robust, and many experts are actually just "redundant" (doing the same thing).
  • The Analogy: If you are editing a legal contract (Text), you need a strict, detailed review. If you are organizing a pile of rocks (Images), you can be a bit more casual.
  • MoDES' Fix: It uses Dual-Modality Thresholding. It has two different rules:
    • Rule A (Text): "Be strict. Only send home the chefs who are 90% sure they aren't needed."
    • Rule B (Images): "Be loose. Send home anyone who isn't 50% sure they are needed."
    • This allows the system to skip way more experts for images without ruining the result.

The Secret Weapon: The "Frontier Search"

To figure out exactly how strict or loose these rules should be, the team had to find the "Goldilocks" zone: skip enough to be fast, but not so much that the food tastes bad.

Usually, finding this perfect balance takes days of trial and error. MoDES uses a Frontier Search Algorithm.

  • The Analogy: Imagine you are looking for the perfect temperature for a shower. A normal person turns the hot and cold knobs randomly until they find it (takes forever). MoDES is like a robot that knows the water gets hotter as you turn the knob one way and colder the other. It slides along the "edge" of the perfect zone instantly.
  • The Result: What used to take 2 days to calculate now takes 2 hours.

The Results: Fast, Cheap, and Delicious

When they tested MoDES on real-world models (like Qwen3-VL and Kimi-VL):

  • Speed: It made the models 2x faster at reading inputs and 1.2x faster at writing answers.
  • Efficiency: It successfully skipped 88% of the experts (sending 88% of the chefs home!) while actually improving the accuracy in some cases.
  • Why? Because by removing the "distracted" or "redundant" chefs, the remaining experts could focus better, and the computer didn't waste energy on unnecessary work.

Summary

MoDES is a smart system that stops Multimodal AI models from wasting energy. Instead of treating every task the same, it realizes that:

  1. Early steps matter more than late steps.
  2. Images are easier to skip than text.
  3. Finding the right balance can be done instantly.

It's like upgrading from a chaotic kitchen where everyone shouts at once, to a streamlined, high-speed kitchen where only the right chefs show up for the right job, making the food faster and tastier.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →