Optimal Transport Aggregation for Distributed Mixture-of-Experts

This paper proposes an optimal transport-based aggregation framework that efficiently combines locally trained Mixture-of-Experts models into a global estimator with a single communication step, achieving performance comparable to centralized training while significantly reducing computational and communication costs.

Faïcel Chamroukhi, Nhat Thien Pham

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are the CEO of a massive company with offices all over the world. You have a huge problem: you need to build a single, perfect "Expert System" to predict customer behavior, but your data is scattered across these different offices.

Here's the catch:

  1. Privacy & Size: You can't move all the data to one central server; it's too big, and some offices can't share their raw data due to privacy laws.
  2. The "Expert" Problem: You don't just want a simple average opinion. You want a Mixture-of-Experts (MoE) model. Think of this not as one generalist, but as a team of specialists.
    • Specialist A is great at predicting behavior for young people.
    • Specialist B is great for seniors.
    • Specialist C handles high-income clients.
    • The system has a "Gatekeeper" (a smart switch) that decides which specialist to listen to based on the customer's profile.

The Old Way (The Bottleneck)

Usually, to solve this, everyone would send their data to the center, or they would constantly chat back and forth, sending tiny updates to a central brain. This is slow, expensive, and clogs the internet (the "communication bottleneck").

The New Problem: The "Smoothie" Mistake

The authors realized that if you just take the local models from each office and average them (like blending a smoothie), you lose the structure.

  • If Office A has 4 specialists and Office B has 4 specialists, a simple average might give you a messy model with 8 confused specialists, or a model where the "Gatekeeper" doesn't know who to listen to anymore.
  • It's like taking four different orchestras, mixing their instruments into a giant pile, and expecting a new symphony to magically play itself. It just sounds like noise.

The Solution: Optimal Transport (The "Moving Company")

The authors propose a clever new method called Optimal Transport Aggregation.

Imagine you have a fleet of trucks (the local models) and you need to move cargo (the knowledge) to build a new, perfect warehouse (the global model).

  • The Goal: You want to build a new warehouse with exactly 4 perfect specialists (just like the original plan), but you only have the blueprints from 10 different local warehouses.
  • The Method: Instead of smashing the blueprints together, you use a "Moving Company" algorithm.
    • The algorithm looks at the local specialists and asks: "Which local 'Young Person' expert looks most like the 'Young Person' expert we need in our new global team?"
    • It calculates the "distance" or "cost" to move the knowledge from the local expert to the global expert.
    • It creates a map (a transportation plan) that pairs up the local experts with the global slots in the most efficient way possible.

The "Frugal" Approach

This is the magic part:

  1. One-Way Trip: The local offices do their work independently. They send their final blueprints (parameters) to the CEO once.
  2. No Chatting: They don't need to keep talking back and forth. The CEO takes all the blueprints, runs the "Moving Company" algorithm to align them perfectly, and builds the new global model.
  3. Speed: Because there's no constant chatting, this is incredibly fast and cheap. It's "frugal" (thrifty) with communication.

The Result

The paper shows that this new method:

  • Preserves the Team Structure: You still get exactly 4 clear specialists, not a messy 8-person blob.
  • Works as Well as the Center: The final model is almost as good as if you had gathered all the data in one room and trained it there.
  • Saves Time: It's 3 to 10 times faster than the old ways because it avoids the communication traffic jam.

In a Nutshell

Think of it like a Global Talent Show.

  • Old Way: Everyone sends their raw video footage to a central studio to edit. (Too much data, too slow).
  • Naive Way: The studio just averages all the videos together. (The result is a blurry mess).
  • This Paper's Way: Each local studio sends a "highlight reel" of their best acts. The central director uses a smart matching system to pair the best local acts with the slots in the final show, ensuring the final lineup is perfect, structured, and ready to go, all with just one quick email exchange.

The authors even proved mathematically that this method is reliable and tested it on real data (like tracking sleep and activity), showing it works just as well as the heavy, slow methods but much faster.