Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

🎨 The Big Picture: The "Art Studio" Problem

Imagine you are running a massive, high-tech Art Studio (this is your AI model, specifically a Diffusion Transformer). Your goal is to paint millions of beautiful pictures.

In the past, every single artist in the studio had to look at every single brushstroke on every single painting. Even if a painting was just a simple blue sky, the artist who specializes in complex faces still had to stare at the sky. This is called a "Dense" model. It works, but it's incredibly slow and wasteful because everyone is doing everything.

To fix this, researchers tried using a "Mixture of Experts" (MoE) system. Think of this as hiring a team of specialists:

Expert A only paints clouds.
Expert B only paints faces.
Expert C only paints trees.

A Router (a manager) stands at the door and decides which artist works on which part of the painting. This is great! It makes the studio faster and allows you to hire way more artists without slowing down.

The Problem: When researchers tried this "Specialist Studio" for images (like the Diffusion Transformers in this paper), it didn't work as well as it did for text (like Chatbots). The images still looked blurry, and the specialists weren't actually specializing. They were all painting the same generic stuff.

Why? The paper argues that images are different from words.

Words are unique: The word "Dog" is very different from "Cat." They are distinct.
Image patches are repetitive: A patch of blue sky looks a lot like the patch next to it. They are redundant.
Confusing Roles: In image generation, the AI sometimes needs to paint without a prompt (unconditional) and sometimes with a prompt (conditional). The old managers didn't know the difference and treated them the same.

🚀 The Solution: ProMoE (The Smart Manager)

The authors introduce ProMoE, a new way to manage the artist studio. Instead of letting the router guess randomly, they give it Explicit Guidance (a clear rulebook).

The router now works in Two Steps:

Step 1: The "Job Type" Check (Conditional Routing)

Before looking at the content of the painting, the manager first asks: "Is this a 'No-Instruction' task or a 'Specific-Instruction' task?"

Unconditional Tokens: If the AI is painting a generic background (no specific prompt), these tokens go to a Special Unconditional Team.
Conditional Tokens: If the AI is painting a specific request (e.g., "A red cat"), these tokens go to the Main Specialist Team.

Analogy: Imagine a hospital. The manager first separates patients into "Emergency Room" (urgent, specific) and "General Check-up" (routine). You don't send a routine check-up patient to the trauma surgeon. This prevents the experts from getting confused.

Step 2: The "Content Match" Check (Prototypical Routing)

Now, for the specific requests (the "Conditional" tokens), the manager needs to decide which specialist handles the "Red Cat."
Instead of guessing, the manager uses Prototypes (like "Sample Cards").

There is a card for "Animals," a card for "Architecture," a card for "Food."
The manager compares the incoming token (the "Red Cat") to these cards.
If the token looks like the "Animals" card, it gets sent to the Animal Expert.

Analogy: Think of a library. Instead of asking a librarian to guess which shelf a book belongs to, you have a set of "Sample Covers" on the desk. You match the book's cover to the sample cover, and boom—it goes to the right section.

🏆 The Secret Sauce: The "Team Building" Workout (Contrastive Loss)

Even with the two steps above, the experts might still get lazy and start painting the same things. To fix this, the authors add a special training exercise called Routing Contrastive Loss.

Analogy: Imagine the manager forces the experts to play a game:

"You, Animal Expert, you must paint only animals. If you accidentally paint a car, you get a penalty."
"You, Architecture Expert, you must paint only buildings. If you paint a dog, you get a penalty."
"Also, make sure your style is totally different from the Animal Expert's style."

This forces the experts to become truly distinct. They stop overlapping and start mastering their specific niches. This creates a studio where every artist is a true master of their craft.

📈 The Results: Why It Matters

The paper tested this new "Smart Studio" on a huge dataset of images (ImageNet).

Better Quality: The pictures generated were sharper and more accurate (lower FID score).
Faster & Cheaper: They achieved these results using fewer active parameters than the old "Dense" models. It's like getting a Ferrari's speed with a Toyota's fuel consumption.
Scalable: As they made the studio bigger (more experts), the quality kept getting better, proving this method works for massive AI models.

🧠 Summary in One Sentence

ProMoE fixes the "confused specialist" problem in AI image generators by giving the router a clear two-step rulebook (separating routine tasks from specific ones) and a strict training regimen to ensure every expert truly specializes in their own unique style.

1. Problem Statement

While Mixture-of-Experts (MoE) has revolutionized the scaling of Large Language Models (LLMs) by increasing capacity without proportional computational cost, its application to Diffusion Transformers (DiTs) for image generation has yielded limited success. Existing attempts (e.g., DiT-MoE, EC-DiT, DiffMoE) often underperform their dense counterparts or offer only marginal gains.

The authors attribute this failure to fundamental differences between language and visual tokens:

Semantic Density vs. Spatial Redundancy: Language tokens are semantically dense with high inter-token variation, naturally facilitating expert specialization. In contrast, visual tokens (image patches) exhibit high spatial redundancy and continuity, causing experts to learn homogeneous features.
Functional Heterogeneity: Diffusion models utilize Classifier-Free Guidance (CFG), which inherently creates two distinct token types: conditional (based on prompts/labels) and unconditional (null conditioning). Naive MoE routers treat these uniformly, ignoring their functional differences and hindering expert specialization.

These factors lead to poor Intra-Expert Coherence (experts processing similar patterns) and Inter-Expert Diversity (different experts specializing in distinct tasks), which are critical for effective MoE performance.

2. Methodology: ProMoE

The paper proposes ProMoE, an MoE framework designed specifically for DiTs featuring a two-step router with explicit routing guidance. The architecture includes shared experts, unconditional experts, and routed experts.

A. Two-Step Router

The router decomposes the routing process into two distinct stages to address functional heterogeneity and semantic redundancy:

Step 1: Conditional Routing (Functional Partitioning)
- Mechanism: A hard routing mechanism partitions input tokens based on their functional role derived from the conditioning signal.
- Action:
  - Unconditional Tokens: Derived from null conditioning (e.g., empty labels), these are deterministically routed to dedicated Unconditional Experts.
  - Conditional Tokens: Derived from specific prompts/labels, these are passed to the second step for fine-grained routing among Routed Experts.
- Goal: Enforces functional segregation, allowing experts to specialize in either unconditional noise prediction or conditional generation.
Step 2: Prototypical Routing (Semantic Assignment)
- Mechanism: Conditional tokens are assigned to routed experts using learnable prototypes ( $P \in \mathbb{R}^{N_E \times D}$ ).
- Action: Instead of a standard linear projection, the router computes the cosine similarity between token embeddings and learnable prototypes in the latent space.
- Activation: Uses an Identity activation function (rather than Softmax or Sigmoid) to preserve relative rankings and ensure stable training for top- $K$ selection.
- Goal: Groups semantically similar tokens to the same expert, fostering intra-expert coherence.

B. Routing Contrastive Learning (RCL)

To further enhance the prototypical routing and explicitly enforce diversity, the authors introduce a Routing Contrastive Loss.

Objective: Pull semantically similar tokens (positive set) toward the same prototype and push dissimilar tokens (negative sets) toward different prototypes.
Formulation: The loss encourages the prototype $p_i$ to align with the centroid of its assigned tokens ( $m_i$ ) while repelling it from centroids of other experts.
Benefit: This acts as a semantic-aware load-balancing regularizer, eliminating the need for manual labels and outperforming traditional load-balancing losses.

3. Key Contributions

Analysis of Token Differences: Identified that visual tokens' spatial redundancy and functional heterogeneity (conditional vs. unconditional) are the primary barriers to MoE success in DiTs, unlike in LLMs.
ProMoE Framework: Designed a two-step router combining Conditional Routing (functional separation) and Prototypical Routing (semantic clustering via learnable prototypes).
Routing Contrastive Loss: Proposed a novel loss function that explicitly enhances semantic guidance, promoting intra-expert coherence and inter-expert diversity without requiring manual annotations.
State-of-the-Art Performance: Demonstrated that ProMoE outperforms dense DiTs and existing MoE baselines across various model sizes (S, B, L, XL) and training objectives (Rectified Flow and DDPM).

4. Experimental Results

Experiments were conducted on the ImageNet-1K benchmark (256x256 resolution) and the GenEval text-to-image benchmark.

Performance vs. Dense Models:
- ProMoE consistently outperforms dense DiTs with equivalent activated parameters.
- Rectified Flow (RF): ProMoE-L-Flow achieves an FID of 2.79 (CFG=1.5), significantly better than Dense-DiT-L-Flow (3.56) and Dense-DiT-XL-Flow (3.23), despite using fewer total parameters than the XL dense model.
- DDPM: Similar improvements were observed, with ProMoE-L-DDPM achieving an FID of 5.12 vs. 6.29 for Dense-DiT-L-DDPM.
Performance vs. MoE SOTAs:
- ProMoE surpasses DiT-MoE, EC-DiT, and DiffMoE. Notably, ProMoE-L (1.06B total params) outperforms DiffMoE-L with 16 experts (1.84B total params).
- GenEval: ProMoE achieved a score of 0.463, outperforming both dense baselines (0.390) and Token-Choice MoE (0.417).
Efficiency: ProMoE achieves lower inference time and GFLOPs compared to SOTA MoE methods like DiffMoE while delivering higher quality.
Scaling: The model shows monotonic performance gains as model size (B $\to$ L $\to$ XL) and the number of experts increase, validating its scalability.
Ablation Studies:
- Removing Conditional Routing or Prototypical Routing significantly degrades performance.
- Identity activation outperforms Softmax and Sigmoid in the routing step.
- RCL is crucial; removing it leads to a ~10% drop in FID.

5. Significance

This paper fundamentally shifts the paradigm of applying MoE to vision models. It argues that implicit routing (relying solely on token semantics learned during training) is insufficient for visual data due to redundancy. By introducing explicit routing guidance based on functional roles and semantic prototypes, ProMoE successfully unlocks the scaling potential of MoE in Diffusion Transformers.

The work provides a robust solution for building larger, more efficient generative models that can match or exceed the quality of dense models while maintaining computational efficiency, addressing a critical bottleneck in the scaling of modern diffusion architectures. The code is publicly available, facilitating further research in efficient diffusion models.