Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

Imagine you have a brilliant, multi-talented chef (the Large Speech Foundation Model, like Whisper). This chef is amazing at cooking general meals and understands many languages. However, they aren't perfect at every specific dish. If you want them to master Portuguese cooking, you usually have to send them to a specialized culinary school for a long time.

Here's the problem: If you want them to master European Portuguese, Brazilian Portuguese, African Portuguese, and children's speech all at once, you can't just send them to four different schools at the same time.

The Old Way (Full Fine-Tuning): You try to train one chef to do everything at once. They get good at European Portuguese, but they might forget how to cook Brazilian dishes or English ones. Plus, if you want to add a new specialty later, you have to retrain the whole chef from scratch, which is expensive and slow.
The Messy Alternative: You hire four different chefs, each trained for one specific style. Now you have a team, but managing four different people for every single order is a logistical nightmare. You have to figure out which chef to call for every request.

The Solution: Model Merging (The "Recipe Fusion")
This paper explores a clever trick called Model Merging. Instead of retraining or hiring new chefs, we take the four specialized chefs (who are already experts in their specific domains) and merge their brains into one single, super-chef.

The goal is to create one model that is as good as the specialists at their specific jobs but still remembers how to cook general meals (multilingual capabilities).

The Experiment: The "European Portuguese" Kitchen

The researchers tested this idea on 10 different European Portuguese dialects and scenarios (like news, elderly speech, children, and broadcast news). They took a base model, trained it separately on each of these 10 scenarios, and then tried to mix them back together using 11 different "mixing algorithms."

Think of these algorithms as different ways to blend smoothies:

Simple Blending (Averaging): Just throw all the ingredients in and blend. (Good, but maybe not perfect).
Smart Blending (Task Vectors): Instead of mixing the whole brain, they look at the changes made during training and try to combine just those changes carefully.
The New Secret Sauce (BoostedTSV-M): This is the paper's big invention.

The Problem with Blending: "Rank Collapse"

When you mix these specialized brains, something weird happens. The "specialized signals" (the unique details that make a chef good at elderly speech) get drowned out by the "general noise." It's like trying to hear a whisper in a loud room; the whisper gets lost. In technical terms, this is called Rank Collapse. The model becomes too simple and forgets the specific details.

The Innovation: BoostedTSV-M

The authors created a new method called BoostedTSV-M.

The Analogy: Imagine you are mixing a cocktail. Usually, the strong flavors (like the main spirit) overpower the subtle notes (like a hint of vanilla).
The Fix: BoostedTSV-M acts like a "flavor booster." It identifies those subtle, weak notes (small singular values) that are about to get lost and amplifies them before mixing.
The Result: The final cocktail retains the complexity of all the ingredients. The merged model doesn't just know "Portuguese"; it knows exactly how to handle the specific quirks of European Portuguese without losing its ability to speak English or other languages.

The Results: What Did They Find?

Better than Retraining: The merged model performed better than training a single model on all the data at once (Full Fine-Tuning) for European Portuguese.
No "Catastrophic Forgetting": Unlike the old method, which made the model forget English and other languages, the merged model kept its multilingual skills intact.
The Trade-off: There is a balancing act. If you push the model too hard to be perfect at European Portuguese (by boosting the signals too much), it might get slightly worse at other languages. But the authors found a "sweet spot" (using a parameter called $\beta$ ) where the model is excellent at the target language while staying robust elsewhere.

The Takeaway

This paper proves that you don't need to retrain massive AI models every time you want to add a new language or dialect. Instead, you can train small, specialized versions and merge them like a high-tech recipe book.

They even built a new tool called MergeWhisper (like a blender specifically for these speech models) to help other researchers do this easily.

In short: They found a way to combine the best of many specialized experts into one "Super-Expert" who is smarter, more versatile, and easier to manage than the old way of doing things.

Here is a detailed technical summary of the paper "Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR."

1. Problem Statement

Large Speech Foundation Models (LSFMs), such as Whisper, have become the standard for Automatic Speech Recognition (ASR). However, adapting these models to specific domains (e.g., medical, legal, specific dialects) typically requires fine-tuning. This creates two major challenges:

Fragmentation: Fine-tuning separately for each domain results in multiple specialized checkpoints, complicating deployment and maintenance (requiring domain detection and model switching at inference).
Computational & Data Constraints: Jointly fine-tuning a single model on all available datasets is often impractical due to privacy restrictions (data unavailability), storage limits, and the high computational cost of re-running full fine-tuning every time new data arrives.
Catastrophic Forgetting: Continual learning approaches often suffer from forgetting previous tasks or require complex mechanisms like replay buffers.

The paper investigates Model Merging as a scalable alternative. The goal is to combine multiple independently fine-tuned domain-specific models into a single unified model without retraining, while preserving both in-domain accuracy and out-of-distribution (OOD) generalization.

2. Methodology

Experimental Setup

Base Model: Whisper Large-v3 (WhisperLv3).
Datasets:
- In-Domain (ID): 10 European Portuguese (EP) corpora (~350 hours of speech) covering diverse speakers and conditions.
- Out-of-Distribution (OOD):
  - Other Portuguese varieties: African/Asian Portuguese (AAP) and Brazilian Portuguese (BP).
  - Cross-lingual: English (OpenASR-HF) and 21 languages (FLEURS).
Baselines:
- Zero-shot: The original foundation model.
- Full Fine-Tuning (Full-FT): Jointly trained on all 10 EP datasets.
- Individual Fine-Tuning (ID-FT): Separate models for each domain (serves as an upper bound for ID performance).

Merging Algorithms Evaluated

The authors benchmarked 11 merging algorithms categorized into three paradigms:

Parameter-Space (PS): Direct averaging of weights (e.g., Model Soups, Karcher Mean, Model Stock).
$\tau$ -Space ( $\tau$ Spa): Merging task vectors (difference between fine-tuned and base weights) (e.g., Task Arithmetic, TIES, PCB, SCE).
$\tau$ -Subspace ( $\tau$ Sub): Merging in low-rank subspaces using Singular Value Decomposition (SVD) (e.g., TSV-M, ISO-C, ISO-CTS).

Proposed Method: BoostedTSV-M

The authors identified that standard TSV-M (Task Singular Vectors Merging) suffers from rank collapse, where small singular values (carrying specific task signals) are suppressed during truncation and concatenation. They proposed BoostedTSV-M to address this:

Singular-Value Boosting: Before concatenation, the algorithm calculates the cumulative energy of singular values. It identifies a threshold index $s^*$ and "boosts" all singular values below this threshold to the value of $\sigma(s^*)$ . This prevents the suppression of task-specific information carried by smaller singular values.
Numerical Stability: The authors replaced the standard Orthogonal Procrustes step (used in TSV-M and ISO-CTS) with Newton–Schulz orthogonalization to prevent numerical instability when retaining high-rank percentages.

Tooling

The authors introduced MergeWhisper, an extension of the mergekit toolkit, adding native support for Whisper models and implementing all evaluated merging strategies.

3. Key Contributions

Comprehensive Benchmark: The first systematic evaluation of 11 model merging algorithms for multi-domain ASR adaptation across 10 European Portuguese domains, covering ID accuracy, OOD robustness, and cross-lingual performance.
BoostedTSV-M: A novel algorithm that mitigates rank collapse via singular-value boosting, achieving state-of-the-art performance among merging methods for the target domain.
MergeWhisper Toolkit: A new open-source toolkit enabling Whisper-compatible model merging, facilitating future research in ASR.
Empirical Analysis of Trade-offs: A detailed investigation into the trade-off between target-domain specialization and cross-lingual robustness.

4. Results

Performance on European Portuguese (EP)

BoostedTSV-M achieved the best overall EP performance (11.55% WER), slightly outperforming Full Fine-Tuning (11.58% WER) with statistical significance ( $p < 0.001$ ).
$\tau$ -Subspace methods (specifically TSV-M variants) generally outperformed $\tau$ -Space and Parameter-Space methods on EP ID and EP OOD data.
Full Fine-Tuning significantly improved EP performance over zero-shot but degraded performance on non-EP OOD data (except AAP), indicating catastrophic forgetting of multilingual capabilities.

Out-of-Distribution (OOD) & Cross-Lingual Robustness

Preservation of Generalization: Unlike Full Fine-Tuning, most merging methods preserved (and in some cases improved) performance on non-EP data (AAP, BP, English, FLEURS).
The Trade-off:
- Parameter-Space (PS) methods (e.g., Karcher Mean, Model Stock) yielded the best results for non-EP OOD (e.g., Model Stock achieved the lowest FLEURS error rate).
- BoostedTSV-M achieved the best EP performance but showed a slight degradation in non-EP OOD (BP, OpenASR-HF, FLEURS) compared to standard TSV-M. This confirms a trade-off: boosting task-specific singular values improves target-domain accuracy at the cost of shared structure needed for OOD transfer.
Surprising Finding: Several merged models outperformed the WhisperLv3-X baseline on FLEURS (21 languages), suggesting that merging models trained on diverse acoustic conditions can improve robustness even for unrelated languages.

Specific Findings on BoostedTSV-M

Hyperparameter $\beta$ : An ablation study showed that lower $\beta$ values (more aggressive boosting) improved ID performance but degraded OOD performance, validating the design choice of trading shared structure for task specificity.
Stability: The substitution of Orthogonal Procrustes with Newton–Schulz orthogonalization allowed for higher rank retention and stable convergence, which was previously unattainable with the original TSV-M implementation in this setting.

5. Significance and Conclusion

This work demonstrates that Model Merging is a viable, computationally efficient alternative to Full Fine-Tuning for multi-domain ASR.

Practical Impact: It enables the deployment of a single unified model that achieves state-of-the-art accuracy on specific target domains (European Portuguese) while maintaining robust generalization across different languages and dialects, avoiding the "model sprawl" of maintaining separate checkpoints.
Scientific Contribution: It highlights the critical trade-off between specialization and generalization in merging. While Full Fine-Tuning maximizes domain accuracy, it risks catastrophic forgetting; merging offers a "sweet spot" where domain performance is near-optimal without sacrificing multilingual robustness.
Future Directions: The proposed BoostedTSV-M and the MergeWhisper toolkit provide a foundation for scalable, continual adaptation of speech foundation models in resource-constrained or privacy-sensitive environments.