Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts

Imagine you have a giant library of 256 different expert chefs. Each chef was hired to master one specific type of cuisine (Italian, Sushi, BBQ, etc.). They all started with the same basic training (the same "pre-trained" brain), but then they each took a specialized, lightweight course to become the best at their specific dish.

Now, you walk into this kitchen with a random order. You don't know what the customer wants (maybe they want Sushi, maybe they want BBQ), and you can't ask the customer what they want. You just have to serve a delicious meal.

The paper asks: What is the best way to use these 256 chefs to cook a single, perfect meal for a random order?

The researchers tested three main strategies, which they call Ensembling, Merging, and Routing. Here is how they work, explained simply:

1. Ensembling: The "Committee Vote"

The Idea: You ask all 256 chefs to cook their version of the dish. Then, you take a spoonful from every single pot, mix them together in a giant bowl, and serve that.

Pros: It's very smart. If the customer wants Sushi, the Sushi chef's contribution will be strong. If they want BBQ, the BBQ chef's flavor will dominate. The final mix is usually delicious.
Cons: It's expensive. You have to wake up 256 chefs, let them cook, and then mix everything. In computer terms, this means running the model 256 times for every single sentence, which takes forever and uses a lot of electricity.

2. Merging: The "Smoothie"

The Idea: Instead of cooking separately, you take the recipes (the weights) of all 256 chefs and blend them into one giant "Super Chef" recipe before you even start cooking.

The Catch: The researchers found that if you just mix the recipes equally (like a smoothie where every ingredient is 1/256th), the result is often bland. Why? Because the "Sushi" recipe and the "BBQ" recipe might actually fight each other when mixed. The "Sushi" part of the recipe might ruin the "BBQ" part.
The Fix: You can try to mix the recipes unevenly (giving more weight to the chefs you think are better), but even then, it's hard to get it right without knowing exactly what the customer wants.

3. Routing: The "Smart Sous-Chef"

The Idea: This is the winner. Instead of mixing everyone's recipes or asking everyone to cook, you hire a Smart Sous-Chef (a router).

How it works: When an order comes in, the Sous-Chef looks at the ingredients (the input text) and instantly decides: "Ah, this looks like a Sushi order. I'll let the Sushi chef do 90% of the work and the others do 10%." For a BBQ order, the Sous-Chef switches the mix instantly.
Pros: It gets the best of both worlds. It's as smart as the "Committee" (Ensembling) because it picks the right experts for the job, but it's as fast as the "Smoothie" (Merging) because it only runs one model.
The Result: The paper found that this Routing strategy was the most accurate, beating both the Committee and the Smoothie.

The Big Trade-Offs (The "So What?")

The paper dives deep into the pros and cons of these methods:

Is the "Committee" (Ensembling) worth the cost?
- Verdict: It's great, but too expensive for everyday use. It's like hiring 256 people to write a single email. It works, but it's overkill.
- Bonus: You can train a single "Smart Chef" to mimic the Committee (this is called Distillation). This gives you 90% of the Committee's taste with only 10% of the cost.
Is the "Smoothie" (Merging) dead?
- Verdict: Not dead, but it's limited. If you just mix the recipes randomly, it fails. However, if you use a smart algorithm to figure out how to mix them, it gets better. But it still can't match the flexibility of the Sous-Chef (Routing).
Can we reduce the number of chefs?
- Verdict: Yes! The paper found that you don't actually need all 256 chefs. Many of them are redundant (e.g., two chefs who both make "Spicy Tacos" are basically the same). By grouping similar chefs into "clusters" (like a "Mexican Food Team"), you can reduce the library from 256 chefs down to just 10 teams without losing much quality. This makes the whole system much faster and cheaper.

The Final Takeaway

If you want the absolute best performance and don't care about speed or cost, use Routing (the Smart Sous-Chef). It dynamically picks the right experts for every single input.

If you need something fast and efficient, you can use Merging (blending the recipes), but you have to be careful about how you blend them.

And if you want to save money on storage, group your experts into clusters first. You don't need 256 unique chefs; you just need a few well-trained teams that can handle a wide variety of orders.

In short: Don't just mix everyone together (Merging), and don't ask everyone to cook at once (Ensembling). Instead, hire a smart manager (Routing) who knows exactly which expert to call for the job at hand.

1. Problem Statement

The paper addresses the challenge of integrating multiple Large Language Models (LLMs) that have been independently fine-tuned on diverse tasks using parameter-efficient methods (specifically LoRA).

Context: With millions of publicly available LoRA experts (e.g., on HuggingFace), there is a need to combine them for multi-task learning without access to the original training data or task identifiers at inference time.
The Core Question: How can we optimally fuse $N$ independent experts to achieve strong, task-agnostic performance while minimizing computational costs?
The Trade-off: Existing strategies involve a trade-off between performance and efficiency:
- Ensembling: High performance but requires $N$ forward passes (high inference cost).
- Merging: Low inference cost (single forward pass) but relies on the "mode connectivity" hypothesis, which may fail in multi-task settings.
- Routing: Input-dependent fusion that offers flexibility but introduces complexity in learning routing coefficients.

2. Methodology

The authors evaluate three primary model fusion strategies using a library of 256 LoRA experts fine-tuned on Flan v2 tasks from a Phi-2 (2.8B) base model. They also utilize a reduced set of 10 Model-Based Clustering (MBC) experts to test scalability.

A. Fusion Strategies

Ensembling: Aggregates outputs (probabilities) from independent experts.
- Uniform: Equal weights ( $\lambda_i = 1/N$ ).
- Learned: Weights $\lambda_i$ are optimized via SGD or Knowledge Distillation.
Merging: Fuses model weights in parameter space ( $w^* = \sum \lambda_i w_i$ $w^{*} = \sum λ_{i} w_{i}$ ).
- Uniform: Simple averaging of LoRA matrices ( $A$ and $B$ ).
- Learned: Weights $\lambda_i$ are optimized via SGD (Global or Layer-dependent).
- Note: The paper investigates whether merging in the low-rank subspace ( $A, B$ ) vs. full-rank space yields different results.
Routing: An input-dependent form of merging where coefficients $\lambda_i(x)$ $λ_{i} (x)$ vary based on the input token.
- Optimization: The routing coefficients are learned via SGD to minimize multi-task loss.
- Baselines: Compared against Arrow, a zero-shot routing mechanism that uses Singular Value Decomposition (SVD) of LoRA matrices to estimate routing weights without joint training.

B. Experimental Setup

Task-Agnostic Setting: The models do not know the task ID during inference, forcing the fusion method to generalize across all 256 tasks.
Baselines:
- Oracle: Perfect knowledge of task ID (selects the best expert).
- Shared Expert: A single LoRA model trained on all 256 tasks.
Evaluation Metric: Average multi-task test loss (Negative Log-Likelihood) across all 256 tasks.

3. Key Contributions & Findings

A. Ensembling vs. Merging

Uniform Ensembling is a surprisingly strong baseline, outperforming all merging strategies and the shared expert baseline, though it requires $N$ forward passes.
Learned Ensembling: Optimizing weights via SGD improves performance further, narrowing the gap to the Oracle.
Distillation: Compressing an ensemble into a single model via knowledge distillation recovers most of the ensemble's performance with only one forward pass, though it doubles the training cost.
Merging Limitations: Uniform merging performs poorly, suggesting that mode connectivity (the assumption that models lie on a low-loss path) breaks down when models are fine-tuned on disjoint, diverse tasks. Even SGD-optimized merging underperforms ensembling.

B. The Superiority of Routing

Routing is Necessary: Input-dependent routing (SGD-optimized) consistently outperforms both merging and ensembling. It nearly closes the gap with the Oracle baseline.
Flexibility: Routing allows the model to dynamically select or weight experts based on the specific input, effectively handling the lack of mode connectivity in multi-task settings.
Robustness: Unlike the Arrow baseline, which relies heavily on selecting a specific top- $k$ subset of experts, SGD-optimized routing is robust to the number of selected experts and assigns more stable, well-calibrated weights.

C. Expert Selection and Efficiency

Clustering (MBC): Reducing 256 experts to 10 via clustering improves performance in ensembling and merging (by reducing noise) but slightly hurts Arrow-based routing (which benefits from fine-grained selection).
Hierarchical Clustering: A method that merges experts within clusters without retraining (using SGD) offers a practical alternative. While it doesn't match the performance of retrained MBC experts, it significantly outperforms standard merging.
Diminishing Returns: Experiments show that using only ~60% of the experts (150/256) is sufficient to recover near-optimal performance, suggesting that expert refactoring can drastically reduce computational overhead without sacrificing accuracy.

4. Results Summary

Based on Figure 2 and the text:

Best Performance: SGD-optimized Routing achieves the lowest loss among all non-Oracle methods.
Runner Up: Uniform Ensembling is competitive but computationally expensive.
Merging: Both Uniform and SGD-optimized merging underperform ensembling, confirming that simple parameter averaging is insufficient for diverse multi-task LoRA experts.
Oracle: Remains the upper bound, but the gap between Routing and Oracle is small.

5. Significance and Implications

Theoretical Insight: The paper challenges the universality of the mode connectivity hypothesis in multi-task learning, demonstrating that while models may be connected in single-task settings, diverse task fine-tuning creates barriers that simple merging cannot cross.
Practical Guidance:
- For maximum performance with moderate compute: Use Routing (specifically SGD-optimized).
- For inference efficiency where training cost is less of a concern: Use Distilled Ensembles.
- For low-resource inference: Use Uniform Merging only if the tasks are highly similar; otherwise, it is inferior to ensembling.
Scalability: The findings suggest that large libraries of experts can be pruned or clustered (e.g., from 256 to 10-15) without significant performance loss, making multi-task LLM deployment more feasible.

In conclusion, the paper establishes that input-dependent routing is the most effective strategy for fusing parameter-efficient experts in a task-agnostic setting, offering a superior balance of performance and flexibility compared to static merging or expensive ensembling.