Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

This paper investigates model merging as a scalable alternative to full fine-tuning for multi-domain ASR, benchmarking 11 algorithms across 10 European Portuguese domains and introducing a novel "BoostedTSV-M" method that outperforms full fine-tuning while preserving out-of-distribution generalization.

Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, multi-talented chef (the Large Speech Foundation Model, like Whisper). This chef is amazing at cooking general meals and understands many languages. However, they aren't perfect at every specific dish. If you want them to master Portuguese cooking, you usually have to send them to a specialized culinary school for a long time.

Here's the problem: If you want them to master European Portuguese, Brazilian Portuguese, African Portuguese, and children's speech all at once, you can't just send them to four different schools at the same time.

  • The Old Way (Full Fine-Tuning): You try to train one chef to do everything at once. They get good at European Portuguese, but they might forget how to cook Brazilian dishes or English ones. Plus, if you want to add a new specialty later, you have to retrain the whole chef from scratch, which is expensive and slow.
  • The Messy Alternative: You hire four different chefs, each trained for one specific style. Now you have a team, but managing four different people for every single order is a logistical nightmare. You have to figure out which chef to call for every request.

The Solution: Model Merging (The "Recipe Fusion")
This paper explores a clever trick called Model Merging. Instead of retraining or hiring new chefs, we take the four specialized chefs (who are already experts in their specific domains) and merge their brains into one single, super-chef.

The goal is to create one model that is as good as the specialists at their specific jobs but still remembers how to cook general meals (multilingual capabilities).

The Experiment: The "European Portuguese" Kitchen

The researchers tested this idea on 10 different European Portuguese dialects and scenarios (like news, elderly speech, children, and broadcast news). They took a base model, trained it separately on each of these 10 scenarios, and then tried to mix them back together using 11 different "mixing algorithms."

Think of these algorithms as different ways to blend smoothies:

  1. Simple Blending (Averaging): Just throw all the ingredients in and blend. (Good, but maybe not perfect).
  2. Smart Blending (Task Vectors): Instead of mixing the whole brain, they look at the changes made during training and try to combine just those changes carefully.
  3. The New Secret Sauce (BoostedTSV-M): This is the paper's big invention.

The Problem with Blending: "Rank Collapse"

When you mix these specialized brains, something weird happens. The "specialized signals" (the unique details that make a chef good at elderly speech) get drowned out by the "general noise." It's like trying to hear a whisper in a loud room; the whisper gets lost. In technical terms, this is called Rank Collapse. The model becomes too simple and forgets the specific details.

The Innovation: BoostedTSV-M

The authors created a new method called BoostedTSV-M.

  • The Analogy: Imagine you are mixing a cocktail. Usually, the strong flavors (like the main spirit) overpower the subtle notes (like a hint of vanilla).
  • The Fix: BoostedTSV-M acts like a "flavor booster." It identifies those subtle, weak notes (small singular values) that are about to get lost and amplifies them before mixing.
  • The Result: The final cocktail retains the complexity of all the ingredients. The merged model doesn't just know "Portuguese"; it knows exactly how to handle the specific quirks of European Portuguese without losing its ability to speak English or other languages.

The Results: What Did They Find?

  1. Better than Retraining: The merged model performed better than training a single model on all the data at once (Full Fine-Tuning) for European Portuguese.
  2. No "Catastrophic Forgetting": Unlike the old method, which made the model forget English and other languages, the merged model kept its multilingual skills intact.
  3. The Trade-off: There is a balancing act. If you push the model too hard to be perfect at European Portuguese (by boosting the signals too much), it might get slightly worse at other languages. But the authors found a "sweet spot" (using a parameter called β\beta) where the model is excellent at the target language while staying robust elsewhere.

The Takeaway

This paper proves that you don't need to retrain massive AI models every time you want to add a new language or dialect. Instead, you can train small, specialized versions and merge them like a high-tech recipe book.

They even built a new tool called MergeWhisper (like a blender specifically for these speech models) to help other researchers do this easily.

In short: They found a way to combine the best of many specialized experts into one "Super-Expert" who is smarter, more versatile, and easier to manage than the old way of doing things.