Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts

This paper empirically evaluates the trade-offs between ensembling, merging, and routing strategies for fusing parameter-efficient LLM experts, finding that while non-uniform ensembling and merging improve performance, input-dependent routing offers the greatest gains and can be made computationally efficient through expert selection techniques.

Sanae Lotfi, Lucas Caccia, Alessandro Sordoni, Jordan T. Ash, Miroslav Dudik

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you have a giant library of 256 different expert chefs. Each chef was hired to master one specific type of cuisine (Italian, Sushi, BBQ, etc.). They all started with the same basic training (the same "pre-trained" brain), but then they each took a specialized, lightweight course to become the best at their specific dish.

Now, you walk into this kitchen with a random order. You don't know what the customer wants (maybe they want Sushi, maybe they want BBQ), and you can't ask the customer what they want. You just have to serve a delicious meal.

The paper asks: What is the best way to use these 256 chefs to cook a single, perfect meal for a random order?

The researchers tested three main strategies, which they call Ensembling, Merging, and Routing. Here is how they work, explained simply:

1. Ensembling: The "Committee Vote"

The Idea: You ask all 256 chefs to cook their version of the dish. Then, you take a spoonful from every single pot, mix them together in a giant bowl, and serve that.

  • Pros: It's very smart. If the customer wants Sushi, the Sushi chef's contribution will be strong. If they want BBQ, the BBQ chef's flavor will dominate. The final mix is usually delicious.
  • Cons: It's expensive. You have to wake up 256 chefs, let them cook, and then mix everything. In computer terms, this means running the model 256 times for every single sentence, which takes forever and uses a lot of electricity.

2. Merging: The "Smoothie"

The Idea: Instead of cooking separately, you take the recipes (the weights) of all 256 chefs and blend them into one giant "Super Chef" recipe before you even start cooking.

  • The Catch: The researchers found that if you just mix the recipes equally (like a smoothie where every ingredient is 1/256th), the result is often bland. Why? Because the "Sushi" recipe and the "BBQ" recipe might actually fight each other when mixed. The "Sushi" part of the recipe might ruin the "BBQ" part.
  • The Fix: You can try to mix the recipes unevenly (giving more weight to the chefs you think are better), but even then, it's hard to get it right without knowing exactly what the customer wants.

3. Routing: The "Smart Sous-Chef"

The Idea: This is the winner. Instead of mixing everyone's recipes or asking everyone to cook, you hire a Smart Sous-Chef (a router).

  • How it works: When an order comes in, the Sous-Chef looks at the ingredients (the input text) and instantly decides: "Ah, this looks like a Sushi order. I'll let the Sushi chef do 90% of the work and the others do 10%." For a BBQ order, the Sous-Chef switches the mix instantly.
  • Pros: It gets the best of both worlds. It's as smart as the "Committee" (Ensembling) because it picks the right experts for the job, but it's as fast as the "Smoothie" (Merging) because it only runs one model.
  • The Result: The paper found that this Routing strategy was the most accurate, beating both the Committee and the Smoothie.

The Big Trade-Offs (The "So What?")

The paper dives deep into the pros and cons of these methods:

  • Is the "Committee" (Ensembling) worth the cost?

    • Verdict: It's great, but too expensive for everyday use. It's like hiring 256 people to write a single email. It works, but it's overkill.
    • Bonus: You can train a single "Smart Chef" to mimic the Committee (this is called Distillation). This gives you 90% of the Committee's taste with only 10% of the cost.
  • Is the "Smoothie" (Merging) dead?

    • Verdict: Not dead, but it's limited. If you just mix the recipes randomly, it fails. However, if you use a smart algorithm to figure out how to mix them, it gets better. But it still can't match the flexibility of the Sous-Chef (Routing).
  • Can we reduce the number of chefs?

    • Verdict: Yes! The paper found that you don't actually need all 256 chefs. Many of them are redundant (e.g., two chefs who both make "Spicy Tacos" are basically the same). By grouping similar chefs into "clusters" (like a "Mexican Food Team"), you can reduce the library from 256 chefs down to just 10 teams without losing much quality. This makes the whole system much faster and cheaper.

The Final Takeaway

If you want the absolute best performance and don't care about speed or cost, use Routing (the Smart Sous-Chef). It dynamically picks the right experts for every single input.

If you need something fast and efficient, you can use Merging (blending the recipes), but you have to be careful about how you blend them.

And if you want to save money on storage, group your experts into clusters first. You don't need 256 unique chefs; you just need a few well-trained teams that can handle a wide variety of orders.

In short: Don't just mix everyone together (Merging), and don't ask everyone to cook at once (Ensembling). Instead, hire a smart manager (Routing) who knows exactly which expert to call for the job at hand.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →