Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Imagine you run a massive, high-end restaurant called The MoE Kitchen. This kitchen is famous for its "Mixture of Experts" cooking style. Instead of one giant chef trying to cook every dish, you have hundreds of specialized chefs (the Experts). One chef is a master of spicy curries, another is a genius at baking, and a third is perfect for sushi.

When a customer orders a dish (a Token), a head waiter (the Router) quickly decides which specialized chef is best suited for that order.

The Problem: The "All-to-All" Chaos

In the old way of running this kitchen (called Expert Parallelism), the chefs are scattered across different islands in the ocean (different GPUs).

If the sushi chef is on Island A, but the customer's order for sushi arrives at the kitchen on Island B, the waiter has to shout the order across the ocean to Island A.
Then, Island A cooks the sushi and shouts the finished dish back to Island B to serve the customer.

If you have 8 islands and 1,000 orders coming in, the waiters spend more time shouting back and forth across the ocean (Communication) than actually cooking (Computation). This is the "All-to-All" bottleneck. The paper notes that in some cases, nearly 60% of the time is wasted just shouting orders across the water!

The Solution: Semantic Parallelism (The "Smart Seating" System)

The authors of this paper propose a new system called Semantic Parallelism, implemented in a framework they built called Sem-MoE.

Instead of treating the chefs and the orders as separate problems, they treat them as a team that needs to be co-scheduled. They realized something amazing: Certain words (tokens) always want to talk to certain chefs, no matter the context.

If the word "math" appears, it almost always wants the "Math Chef."
If the word "law" appears, it almost always wants the "Law Chef."

They call this Token-Expert Affinity. It's like knowing that "Pizza" always goes to the Italian chef, regardless of whether the customer is ordering it for lunch or dinner.

How Sem-MoE Works (The Three Magic Tricks)

1. The Offline "Seating Chart" (Model Scheduling)
Before the restaurant even opens for the day, the managers look at a huge list of past orders. They notice that the "Math Chef," "Science Chef," and "History Chef" are often needed together.

The Move: They move these three chefs to sit at the same table (the same GPU).
The Result: Now, if an order comes in that needs all three, they can just whisper to each other across the table. No shouting across the ocean is needed!

2. The "Smart Grouping" for Groups (Inter-Request Scheduling)
When customers arrive in groups (Data Parallelism), the old system just sent them to random tables.

The Move: Sem-MoE looks at the group's order list before they sit down. If a group is ordering mostly "Math" and "Science" dishes, the system guides them to the island where the Math and Science chefs are sitting.
The Result: The whole group stays local. The waiters don't have to run back and forth.

3. The "Pre-emptive Shuffle" for Individuals (Intra-Request Scheduling)
Sometimes, a single customer has a long, complex order with many different dishes (Tensor Parallelism).

The Move: As the waiter is walking the order to the kitchen, they don't just hand it to the first chef they see. They use a "magic shuffle" to rearrange the dishes while they are walking, so that the "Sushi" part of the order is handed directly to the Sushi chef's station, and the "Baking" part goes to the Baker's station, all in one smooth motion.
The Result: The order is split up and sent to the right places before it even arrives, eliminating the need for a second trip back and forth.

The Results: A Faster, Quieter Kitchen

By using this "Semantic Parallelism," the kitchen becomes incredibly efficient:

Less Shouting: The amount of time spent shouting orders across the ocean (communication) drops dramatically.
More Cooking: The chefs spend more time actually cooking.
Faster Service: Customers get their food much faster.

In their tests, this new system made the restaurant up to 2.78 times faster in some scenarios and reduced delays by nearly 25% in others. It proved that by simply understanding who likes to talk to whom (the data) and arranging the kitchen accordingly (the model), you can solve the biggest bottleneck in AI speed.

In short: Instead of forcing every order to travel to a random chef and back, Sem-MoE moves the chefs to the right neighborhoods and guides the orders to the right doors, turning a chaotic shouting match into a smooth, local conversation.

1. Problem Statement

Large Language Models (LLMs) utilizing the Mixture-of-Experts (MoE) architecture have become the standard for scaling model capacity while maintaining computational efficiency. However, serving these models faces a critical bottleneck: inter-device communication overhead.

The Bottleneck: Current serving engines (e.g., SGLang, vLLM) use Expert Parallelism (EP), distributing experts across multiple GPUs/NPUs. When a token is routed to an expert located on a different device, it requires expensive all-to-all (all2all) collective communication operations to shuffle tokens.
The Inefficiency: State-of-the-art systems treat model placement (where experts are located) and data scheduling (which requests/tokens go to which device) as separate problems. This decoupling leads to excessive cross-device token routing, where a significant portion of inference latency (up to ~59% in some MoE layers) is spent on communication rather than computation.
The Goal: To minimize all2all communication volume by proactively co-scheduling tokens and experts, thereby increasing the Local Activation Rate (LAR)—the percentage of tokens processed by experts residing on the same device.

2. Methodology: Semantic Parallelism

The authors propose Semantic Parallelism, a paradigm that leverages the intrinsic affinity between input tokens and specific experts to optimize both model placement and data scheduling. This is implemented in a system called Sem-MoE, integrated into the SGLang inference engine.

The approach consists of three core components:

A. Offline Token-Expert Affinity Modeling

The system first profiles the model to discover that token routing is largely context-independent.

Observation: Specific tokens (e.g., specific words or sub-words) consistently activate the same subset of "hot" experts regardless of the surrounding context.
Modeling: An offline process constructs a Token-to-Expert Confidence Table ( $C_p$ ). This table predicts the probability of a specific token activating a specific expert based on historical activation patterns.
Inter-layer Affinity: The system also models Markovian dependencies between layers, predicting that if an expert is selected at layer $L$ , a specific set of experts is likely to be selected at layer $L+1$ .

B. Model-Data Co-Scheduling (The Core Algorithm)

Sem-MoE formulates the scheduling problem as a 0-1 Integer Linear Programming (ILP) problem, solved via an alternating optimization algorithm. The goal is to co-cluster tokens and experts to maximize intra-cluster activation and balance load.

Offline Model Scheduling: Based on the affinity profiles, experts are re-grouped and placed onto specific devices. Experts that are frequently activated together are collocated on the same GPU to minimize cross-device routing.
Online Data Scheduling:
- For Attention-Data Parallelism (Attention-DP): The system performs Inter-request Scheduling. Incoming requests are dynamically batched and assigned to specific DP ranks based on the aggregated token-expert affinity of the request. This ensures entire requests are processed on the device hosting their most likely experts.
- For Attention-Tensor Parallelism (Attention-TP): The system performs Intra-request Scheduling. Instead of standard all-reduce operations, Sem-MoE introduces Shuffled-Reduce-Scatter (SRS) and Shuffled-Allgather (SAG). These custom kernels proactively shuffle token activations during the communication phase, directing them to the devices predicted to host their target experts before the MoE layer computation begins.

C. System Implementation

Integration: Built as a plugin for SGLang.
Optimization: Uses custom Triton kernels for high-performance shuffling and argsort operations.
Overhead: The overhead of embedding shuffling logic into the communication schedule is negligible (~1%).

3. Key Contributions

Discovery of Context-Independent Affinity: The paper provides empirical evidence that token-expert routing in large MoE models is highly predictable based on token identity alone, independent of complex context, enabling proactive scheduling.
Semantic Parallelism Paradigm: A novel framework that unifies model placement and data scheduling, moving away from the traditional separation of these concerns.
Sem-MoE System: An implementation that achieves:
- 15.4% improvement in local activation rate compared to baseline methods.
- Shuffled-Reduce-Scatter/Allgather: Custom communication primitives that fuse token routing with data transformation in TP scenarios.
Comprehensive Evaluation: Extensive testing on multiple models (DeepSeek-V2-Lite, Qwen3-30B-A3B, Moonlight-16B) and datasets (ShareGPT, LMSYS-Chat-1M, MMLU).

4. Experimental Results

The system was evaluated on an 8-GPU server with high-speed interconnects (400GB/s).

Throughput Gains (Attention-DP):
- Sem-MoE achieved up to 2.78x throughput improvement over SGLang under specific Service Level Objective (SLO) constraints (TTFT and E2E latency).
- Compared to MoETuner (a state-of-the-art expert placement baseline), Sem-MoE showed up to 278% throughput improvement in high-load scenarios.
Latency Reduction (Attention-TP):
- Achieved up to 24.9% reduction in End-to-End (E2E) latency for Qwen3-30B-A3B.
- Reduced Time to First Token (TTFT) by up to 18.89% for DeepSeek-V2-Lite.
Communication Reduction:
- Increased the Local Activation Rate (LAR) by 37% to 43% compared to vanilla placement.
- This directly translated to a 41.8% to 46.6% reduction in the latency of individual MoE layers.
Generalizability: The token-expert predictor demonstrated robust zero-shot transfer capabilities. A model trained on one dataset (e.g., ShareGPT) maintained high performance (LAR ~~41-46%) when applied to unseen datasets (e.g., MMLU), significantly outperforming the SGLang baseline (~~25% LAR).

5. Significance

Semantic Parallelism represents a paradigm shift in MoE inference optimization. By recognizing that the "semantic" content of tokens dictates their routing path, the system moves from reactive load balancing to proactive co-scheduling.

Scalability: It addresses the primary bottleneck (communication) in scaling MoE models, making trillion-parameter models more economically viable for inference.
Hardware Agnosticism: The approach reduces reliance on ultra-fast interconnects by minimizing the volume of data that needs to be transferred, making MoE inference more efficient even on standard hardware.
Practical Impact: The integration into SGLang demonstrates that these optimizations can be deployed in production-grade serving engines without modifying the underlying model architecture, offering immediate performance gains for existing MoE models like DeepSeek and Qwen.

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

The Problem: The "All-to-All" Chaos

The Solution: Semantic Parallelism (The "Smart Seating" System)

How Sem-MoE Works (The Three Magic Tricks)

The Results: A Faster, Quieter Kitchen

1. Problem Statement

2. Methodology: Semantic Parallelism

A. Offline Token-Expert Affinity Modeling

B. Model-Data Co-Scheduling (The Core Algorithm)

C. System Implementation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks