Original authors: Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan Minh Nguyen, Toan Tran

Published 2026-06-05

📖 5 min read🧠 Deep dive

Original authors: Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan Minh Nguyen, Toan Tran

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you run a massive, high-tech call center. You have thousands of incoming calls (tokens) and a team of 100 specialized agents (experts). Your goal is to route every call to the best agent to solve the problem quickly.

In a standard setup, you use a simple rule: "Send the call to the agent who seems most qualified right now." This is like a Softmax router. The problem? The same few "super-agents" get all the calls, while the other 90 agents sit idle. This is called routing collapse. The call center becomes inefficient because you aren't using your full team.

To fix this, previous methods tried to force balance. They added a "manager" who constantly nagged the system, saying, "Hey, Agent 5 hasn't had a call in an hour, send one to them!" or "Agent 1 is too busy, stop sending calls!" These are the auxiliary losses mentioned in the paper. While they help, they are clunky, add extra work for the computer, and sometimes confuse the system about what it's actually trying to learn.

The New Idea: The "Perfectly Balanced" Assignment

The authors of this paper propose a smarter way to assign calls, using a mathematical concept called Optimal Transport (specifically the Sinkhorn algorithm).

Think of this not as a manager nagging agents, but as a perfectly choreographed dance.

The Goal: Every agent must get exactly the same number of calls over time, and every caller must be matched to an agent who is good at their job.
The Method: Instead of just picking the "best" agent for each call, the system calculates a global map. It looks at all calls and all agents at once and figures out the most efficient way to distribute the work so that no one is overloaded and no one is bored.

The Problem with the "Perfect Dance"

There's a catch. If you force this perfect balance on every single call as it comes in, the system gets confused. It might send a call about "coding" to an agent who is great at "cooking" just to keep the numbers even. This hurts performance.

The paper's breakthrough is Selective Sinkhorn Routing (SSR).

How SSR Works: The "Hybrid" Strategy

Instead of using the complex "perfect dance" for every single call, SSR uses a clever mix:

Most of the time (99%+): It uses the standard, fast method (Softmax) to route calls. This lets the system learn what agents are actually good at.
Rarely (0.1% to 1% of the time): It pauses and runs the "perfect dance" (Sinkhorn algorithm).
- Why? This tiny bit of "perfect balancing" acts like a gentle nudge. It reminds the system, "Don't forget the other agents!" without forcing a bad match on every single call.
- The Result: The system learns to balance itself naturally, without needing a nagging manager (auxiliary loss) or a massive amount of extra computing power.

The Secret Sauce: Adding a Little Noise

The paper also suggests adding a tiny bit of random noise (like static on a radio) to the decision-making process during training.

Analogy: Imagine the agents are slightly drunk or the phone lines are a bit fuzzy. The system can't be 100% sure who is the "best" agent, so it tries a few different people.
Benefit: This prevents the system from getting stuck in a rut where it always picks the same top 3 agents. It forces the system to explore and discover that other agents are actually quite good too.
Important Note: The paper says you turn this noise off when the system is actually working (inference). You don't want your call center to be random when a customer is waiting; you want it to be fast and deterministic.

What They Found

The authors tested this on two main tasks:

Language Modeling (Writing): They tested it on datasets like WikiText-103.
- Result: Their method (SSR) wrote better text (lower "perplexity," which is a score for how confused the AI is) than previous methods.
- Speed: It was much faster to train because it didn't need the heavy "nagging" losses. It only used the complex math for a tiny fraction of the time.
Image Classification (Vision): They tested it on ImageNet (recognizing pictures).
- Result: It recognized images more accurately and was better at handling weird, corrupted, or "adversarial" images (images designed to trick AI).

The Bottom Line

The paper claims that Selective Sinkhorn Routing is a lightweight, efficient way to fix the "routing collapse" problem in Sparse Mixture of Experts (SMoE) models.

Old Way: Use heavy, complex math or nagging penalties to force balance. (Slow, sometimes unstable).
New Way (SSR): Use the complex math only rarely to guide the system, and add a little randomness to keep things interesting during training.
Outcome: You get a smarter, more balanced AI that trains faster and works better, without the extra baggage.

Crucially, the paper emphasizes that the "perfect balancing" and "noise" are only for training. When the model is actually being used in the real world, it switches back to a standard, fast, deterministic mode. This ensures the final product is both high-quality and efficient.

Technical Summary: Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

Problem Statement

Sparse Mixture-of-Experts (SMoE) models offer a scalable approach to increasing model capacity while maintaining computational efficiency by activating only a subset of experts per token. However, conventional SMoE models relying on Softmax-based routing often suffer from "routing collapse," where a small subset of experts is over-utilized while others remain underused.

Existing mitigation strategies typically rely on:

Auxiliary Objectives: Load-balancing losses or z-losses to penalize uneven expert usage. These can introduce objective misalignment, training instability, and increased complexity.
Noisy Gating: Injecting noise to encourage exploration, which adds trainable parameters or hyperparameters.
Sinkhorn-based Routing: Prior methods using the Sinkhorn algorithm for optimal transport (OT) often use the transport map solely for expert selection (top-k) rather than weight assignment. Furthermore, they may reduce routing flexibility because the gating matrix is not directly optimized through gradient-based learning in the same way as Softmax methods, or they incur substantial training overhead due to the iterative nature of Sinkhorn updates.

Methodology

The paper proposes Selective Sinkhorn Routing (SSR), a lightweight mechanism that reformulates token-to-expert assignment as an entropy-regularized optimal transport problem.

Core Formulation

The authors model the assignment of $m$ tokens to $n$ experts as a maximum-cost optimal transport problem with entropic regularization. The goal is to find a transport plan $\hat{\Pi}$ that maximizes compatibility while satisfying specific constraints:

Positivity: $\Pi > 0$ .
Row Sum: Each token routes its entire mass to experts ( $\Pi \mathbf{1}_n = \mathbf{1}_m$ ).
Column Sum: Each expert receives an equal expected total load ( $\Pi^\top \mathbf{1}_m = \frac{m}{n} \mathbf{1}_n$ ).

The cost matrix $C$ is derived from the gating scores $S$ . The authors propose two variants:

Linear Cost: $C = S$ .
Softmax Cost: $C_{i,:} = \text{softmax}(S_{i,:})$ , used to prevent numerical overflow during Sinkhorn iterations.

Unlike prior OT-based methods that use the transport plan only for selection, SSR directly uses the values in $\hat{\Pi}$ to compute routing weights. For each token, the top- $k$ experts are selected based on the highest entries in the corresponding row of $\hat{\Pi}$ , and weights are normalized from these entries.

Selective Strategy and Noise Injection

A key challenge identified is that Sinkhorn routing does not update the gating weight matrix $W_g$ because $W_g$ is decoupled from the OT objective. To address this, SSR employs a hybrid training strategy:

Selective Application: During training, each MoE block uses Sinkhorn routing with a small probability $p$ (e.g., 0.001) and standard Softmax gating otherwise. This allows Softmax gating to update $W_g$ effectively while Sinkhorn routing provides balanced expert utilization signals.
Noise Injection: To prevent expert collapse and encourage exploration, Gaussian noise is added to the cost matrix ( $\tilde{C} = C + \alpha_{noise} \cdot \epsilon$ ). Theoretical analysis (Proposition 4.3) shows this ensures every expert has a non-zero probability of selection.

Inference Behavior

The paper argues that enforcing balanced constraints (like equal column sums) during inference on a single input is counterproductive, as it forces uniform routing regardless of the input's specific compatibility scores. Therefore, both Sinkhorn routing and noise injection are disabled during inference, reverting to deterministic Softmax routing to ensure consistent, input-dependent predictions.

Key Contributions

Novel Routing Framework: A method integrating entropy-regularized optimal transport with stochastic noise injection to promote balanced expert utilization without auxiliary balancing losses.
Theoretical Insights: Proofs demonstrating that Sinkhorn-based routing and noise injection aid training by encouraging exploration and balancing, but should be disabled at inference to avoid distorted assignments.
Efficiency and Performance: Extensive evaluations showing that SSR improves training efficiency, accuracy, and robustness to input corruption compared to state-of-the-art baselines.

Experimental Results

The authors evaluated SSR on language modeling (WikiText-103, Enwik-8) and image classification (ImageNet-1K, ImageNet-A, ImageNet-O, ImageNet-R).

Language Modeling:
- On WikiText-103, SSR variants consistently outperformed Vanilla SMoE and baselines using load-balancing loss, z-loss, or noise.
- SSR-L w/ noise achieved a test perplexity (PPL) of 34.367, a reduction of 1.183 compared to Vanilla SMoE (35.550), outperforming the next best baseline by a significant margin.
- Crucially, SSR incurred minimal training overhead (0.33% – 0.65%), whereas full Sinkhorn-based SMoE incurred 72.47% overhead.
- Under Momentum settings, SSR preserved the characteristic norm evolution of Vanilla SMoE while improving performance, whereas other balancing methods degraded performance.
Byte-level Modeling (Enwik-8):
- SSR-S w/ noise achieved the best test Bits-per-Character (BPC) of 1.128, improving Vanilla SMoE by 0.010 BPC.
- It was 2.28x faster than full Sinkhorn-based SMoE.
Vision Tasks:
- On ImageNet-1K, SSR-L w/ noise achieved a Top-1 accuracy of 77.420, a gain of +2.368 over Vanilla SMoE.
- It also showed strong robustness on ImageNet-O and ImageNet-R.
Ablation Studies:
- The probability $p$ of applying Sinkhorn routing is critical; very small values (e.g., $10^{-3}$ or $10^{-4}$ ) were sufficient to provide balancing signals without over-constraining the router.
- Noise injection consistently improved performance across settings.
- Disabling balancing mechanisms at inference yielded the best results, confirming the theoretical claims.

Significance and Claims

The paper claims that Selective Sinkhorn Routing offers a practical and effective solution for SMoE design. By replacing complex auxiliary losses with efficient, intermittent Sinkhorn-based optimization, SSR achieves:

Improved Training Stability: Through built-in expert balancing without objective misalignment.
Higher Accuracy: Demonstrated across text and vision benchmarks.
Robustness: Enhanced performance against input corruption and adversarial examples.
Efficiency: Minimal computational overhead compared to both auxiliary-loss methods and full Sinkhorn routing.

The authors emphasize that the method is lightweight, requires no additional trainable parameters (in the noise variant), and maintains the flexibility of expert selection while ensuring balanced utilization during the training phase.

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts