MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

Imagine you have a massive, super-smart team of specialists (like a giant library of experts) working together to answer questions about pictures and videos. This team is called a Vision-Language Model (VLM).

In the past, to keep this team fast and efficient, the manager (the computer) used a very strict rule: "For every single question, pick the top 2 experts from the list and ignore everyone else." This is called Top-K Routing.

The Problem: The "Boring Manager"

The problem with this strict rule is that the manager is a bit lazy and predictable.

Overfitting: The manager always picks the same two "favorite" experts for every type of question. If the question is about a dog, the manager always picks "Expert Dog" and "Expert Animal," even if "Expert Art" or "Expert Logic" might have a better answer.
Missed Opportunities: Because the manager never tries different combinations, the team never learns that sometimes a mix of experts works better. The team gets stuck in a rut, over-relying on a few people while the rest of the team sits idle.

The Solution: MoE-GRPO (The "Smart Coach")

The authors of this paper introduced a new way to manage the team called MoE-GRPO. Instead of a strict rule, they use a Reinforcement Learning coach (like a sports coach who learns by trial and error).

Here is how it works, using a simple analogy:

1. The "Rollout" Game (Trial and Error)

Imagine the manager has to answer a question: "Why is the sky blue?"

Old Way (Top-K): The manager immediately picks the two experts they think are best and writes the answer. If it's wrong, they just try again with the same two experts.
New Way (MoE-GRPO): The manager runs 8 different simulations (called "rollouts") at once.
- Simulation 1: Picks "Expert Physics" and "Expert Art."
- Simulation 2: Picks "Expert Atmosphere" and "Expert History."
- Simulation 3: Picks "Expert Light" and "Expert Math."
- ...and so on.

2. The Reward System (The Scoreboard)

After all 8 simulations, the coach checks the answers.

If Simulation 1 gets the answer right, the coach gives it a high score (Reward).
If Simulation 2 gets it wrong, it gets a low score.

3. Learning the Best Team (Policy Optimization)

The coach doesn't just pick the winner; they analyze why the winner won.

"Hey, Simulation 1 worked because we picked Physics and Art together!"
"Simulation 2 failed because History isn't useful for this question."
The coach then updates the manager's brain: "Next time, prioritize Physics and Art for sky questions."

Over time, the manager stops guessing and learns the perfect combination of experts for every single type of question.

The Secret Sauce: "Modality-Aware Guidance"

There was one risk: What if the manager tries to pick "Expert Cooking" to answer a question about a video of a car crash? That's a waste of time.

To fix this, the authors added a Modality-Aware Guide.

Think of this as a traffic cop at the entrance of the expert room.
If the input is a picture, the traffic cop blocks the "Text-Only" experts from entering the room, telling them, "Not your job today!"
If the input is text, the traffic cop blocks the "Visual" experts.
This stops the manager from wasting time trying random combinations that make no sense, making the learning process much faster and more stable.

The Results: Why It Matters

When they tested this new system:

Better Answers: The model got more questions right than the old "Top-K" system.
Less Boredom: The experts were used more evenly. Instead of the same two experts doing all the work, different experts got to shine depending on the task.
Generalization: The model became smarter at handling new types of questions it had never seen before, because it learned how to mix and match experts flexibly rather than just memorizing a fixed list.

In a Nutshell

MoE-GRPO is like upgrading a team manager from a rigid rule-follower to a strategic coach. Instead of blindly picking the same two people for every job, the coach tries many different team combinations, sees which ones win, and learns the perfect lineup for every specific challenge. This makes the AI smarter, more efficient, and less likely to get stuck in a rut.

1. Problem Statement

Context: Mixture-of-Experts (MoE) architectures are widely used in Vision-Language Models (VLMs) to scale model capacity while maintaining computational efficiency by sparsely activating a subset of parameters (experts) for each token.
The Limitation: The standard approach for activating experts is deterministic Top-K routing, where the $K$ experts with the highest gating scores are selected for every token.
The Core Issues:

Limited Exploration: Deterministic selection restricts the model to a single "greedy" path, preventing the discovery of potentially more optimal expert combinations.
Expert Overfitting: The model tends to overfit to a small subset of experts, leading to poor generalization and a lack of task-level specialization.
Suboptimal Policy: Existing stochastic methods (e.g., adding Gaussian noise) only partially address exploration but do not explicitly optimize the expert selection policy itself.

2. Methodology: MoE-GRPO

The authors propose MoE-GRPO, a Reinforcement Learning (RL) framework that treats expert selection as a sequential decision-making problem, optimized using Group Relative Policy Optimization (GRPO).

A. Formulation as Sequential Decision Making

Unlike standard GRPO where the "action" is sampling the next output token, MoE-GRPO expands the action space to include expert routing decisions across all tokens and layers.

Action Space: A sequence of expert selections $[o_{1,1}, o_{1,2}, \dots, o_{T,L}]$ covering $T$ tokens and $L$ layers.
Objective: The model learns to select diverse expert combinations that maximize a verifiable reward (e.g., answer accuracy).

B. Dual-Objective Training Framework

To jointly optimize token generation and expert routing, MoE-GRPO defines two sub-objectives:

Token-GRPO: Optimizes the quality of the generated output sequence ( $y$ ) based on the selected experts ( $E$ ). It reinforces expert selection policies that lead to high-reward outputs.
Gate-GRPO: Directly optimizes the gating network ( $g$ $g$ ) at each layer. It provides dense, fine-grained supervision by rewarding specific expert selections ( $E_{t,l}$ $E_{t, l}$ ) that contribute to high-reward outcomes, effectively guiding the router toward reward-aligned utilization.
- Total Loss: $\mathcal{L}_{\text{MoE-GRPO}} = \mathcal{L}_{\text{Token-GRPO}} + \mathcal{L}_{\text{Gate-GRPO}}$

C. Modality-Aware Router Guidance

To address the inefficiency of exploring a massive search space in multi-modal settings, the authors introduce a guidance mechanism:

Mechanism: The system calculates modality-awareness scores ( $\hat{s}_v$ for vision, $\hat{s}_t$ for text) for each expert based on historical activation frequencies.
Constraint: During training, the bottom $P\%$ (e.g., 25%) of experts that are least relevant to the current input modality (e.g., visual experts for image tokens) are deactivated (gating scores set to $-\infty$ ).
Benefit: This reduces unnecessary exploration of irrelevant experts, stabilizing training and improving convergence speed while maintaining diversity within the relevant modality space.

3. Key Contributions

First RL-based Expert Selection: This is the first work to formulate expert selection in MoE-based VLMs as a sequential decision-making problem and optimize it via RL (specifically GRPO).
Novel Training Objective: The introduction of Gate-GRPO allows for direct optimization of the routing policy, complementing standard token-level generation optimization.
Modality-Aware Guidance: A mechanism that balances exploration efficiency and stability by restricting the search space to modality-relevant experts without sacrificing diversity.
Comprehensive Evaluation: Extensive experiments demonstrating that RL-based routing outperforms deterministic and heuristic stochastic baselines.

4. Experimental Results

The authors converted InternVL3.5-1B into an MoE architecture (1.3B active parameters out of 2.9B total) and trained it using MoE-GRPO.

Multi-modal Benchmarks: MoE-GRPO outperformed standard Top-K (Deterministic FT) and stochastic variants (Noise/Multinomial) by 2.0% – 2.3% in average accuracy across 9 image/video benchmarks (e.g., MMBench, VideoMME).
Cross-Dataset Generalization: When applied to CLIP-MoE for image classification, MoE-GRPO achieved a 3.1% average accuracy gain over deterministic fine-tuning, significantly reducing overfitting.
Domain Generalization: On out-of-domain datasets (ImageNet-V2, ImageNet-S, etc.), MoE-GRPO consistently outperformed baselines, showing a 4.1% gain over the base CLIP-MoE and 1.5% over Det-FT.
Ablation Studies:
- Both Token-GRPO and Gate-GRPO are necessary; removing Gate-GRPO caused a significant performance drop (55.7% $\to$ 50.9%).
- Modality-aware guidance outperformed modality-agnostic noise/multinomial sampling by 1.5%.
- The method is robust across different RL algorithms (GRPO, DAPO, SAPO).

5. Significance and Analysis

Diversity & Specialization: Analysis shows MoE-GRPO increases the entropy of routing distributions (from 1.05 to 1.82), indicating more diverse expert usage. It also induces task-level specialization, where different experts are activated for different tasks (JSD increased from 0.06 to 0.20).
Mitigating Overfitting: By exploring diverse expert combinations rather than sticking to a greedy Top-K path, the model avoids over-reliance on specific experts, leading to better generalization.
Qualitative Improvement: Visualizations show that MoE-GRPO makes more adaptive routing decisions across layers, allowing the model to solve complex reasoning tasks (e.g., counting objects) where deterministic baselines fail.

Conclusion: MoE-GRPO demonstrates that treating expert routing as a learnable policy via Reinforcement Learning, guided by modality constraints, significantly enhances the performance, generalization, and efficiency of large-scale Vision-Language Models.