Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

This paper introduces Semantic Parallelism, a novel model-data co-scheduling paradigm implemented in the Sem-MoE framework that significantly enhances MoE inference efficiency by proactively collocating experts and their activating tokens on the same devices to minimize costly all-to-all communication.

Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you run a massive, high-end restaurant called The MoE Kitchen. This kitchen is famous for its "Mixture of Experts" cooking style. Instead of one giant chef trying to cook every dish, you have hundreds of specialized chefs (the Experts). One chef is a master of spicy curries, another is a genius at baking, and a third is perfect for sushi.

When a customer orders a dish (a Token), a head waiter (the Router) quickly decides which specialized chef is best suited for that order.

The Problem: The "All-to-All" Chaos

In the old way of running this kitchen (called Expert Parallelism), the chefs are scattered across different islands in the ocean (different GPUs).

  • If the sushi chef is on Island A, but the customer's order for sushi arrives at the kitchen on Island B, the waiter has to shout the order across the ocean to Island A.
  • Then, Island A cooks the sushi and shouts the finished dish back to Island B to serve the customer.

If you have 8 islands and 1,000 orders coming in, the waiters spend more time shouting back and forth across the ocean (Communication) than actually cooking (Computation). This is the "All-to-All" bottleneck. The paper notes that in some cases, nearly 60% of the time is wasted just shouting orders across the water!

The Solution: Semantic Parallelism (The "Smart Seating" System)

The authors of this paper propose a new system called Semantic Parallelism, implemented in a framework they built called Sem-MoE.

Instead of treating the chefs and the orders as separate problems, they treat them as a team that needs to be co-scheduled. They realized something amazing: Certain words (tokens) always want to talk to certain chefs, no matter the context.

  • If the word "math" appears, it almost always wants the "Math Chef."
  • If the word "law" appears, it almost always wants the "Law Chef."

They call this Token-Expert Affinity. It's like knowing that "Pizza" always goes to the Italian chef, regardless of whether the customer is ordering it for lunch or dinner.

How Sem-MoE Works (The Three Magic Tricks)

1. The Offline "Seating Chart" (Model Scheduling)
Before the restaurant even opens for the day, the managers look at a huge list of past orders. They notice that the "Math Chef," "Science Chef," and "History Chef" are often needed together.

  • The Move: They move these three chefs to sit at the same table (the same GPU).
  • The Result: Now, if an order comes in that needs all three, they can just whisper to each other across the table. No shouting across the ocean is needed!

2. The "Smart Grouping" for Groups (Inter-Request Scheduling)
When customers arrive in groups (Data Parallelism), the old system just sent them to random tables.

  • The Move: Sem-MoE looks at the group's order list before they sit down. If a group is ordering mostly "Math" and "Science" dishes, the system guides them to the island where the Math and Science chefs are sitting.
  • The Result: The whole group stays local. The waiters don't have to run back and forth.

3. The "Pre-emptive Shuffle" for Individuals (Intra-Request Scheduling)
Sometimes, a single customer has a long, complex order with many different dishes (Tensor Parallelism).

  • The Move: As the waiter is walking the order to the kitchen, they don't just hand it to the first chef they see. They use a "magic shuffle" to rearrange the dishes while they are walking, so that the "Sushi" part of the order is handed directly to the Sushi chef's station, and the "Baking" part goes to the Baker's station, all in one smooth motion.
  • The Result: The order is split up and sent to the right places before it even arrives, eliminating the need for a second trip back and forth.

The Results: A Faster, Quieter Kitchen

By using this "Semantic Parallelism," the kitchen becomes incredibly efficient:

  • Less Shouting: The amount of time spent shouting orders across the ocean (communication) drops dramatically.
  • More Cooking: The chefs spend more time actually cooking.
  • Faster Service: Customers get their food much faster.

In their tests, this new system made the restaurant up to 2.78 times faster in some scenarios and reduced delays by nearly 25% in others. It proved that by simply understanding who likes to talk to whom (the data) and arranging the kitchen accordingly (the model), you can solve the biggest bottleneck in AI speed.

In short: Instead of forcing every order to travel to a random chef and back, Sem-MoE moves the chefs to the right neighborhoods and guides the orders to the right doors, turning a chaotic shouting match into a smooth, local conversation.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →