AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

Imagine you have a team of specialist chefs. One is a master of Italian pasta, another is a sushi expert, and a third is a pastry genius. Each has spent years perfecting their craft.

Now, imagine you want to open a restaurant that serves all three cuisines perfectly, but you only have one kitchen and one head chef to run it. You can't hire three separate chefs (that's too expensive and takes up too much space). So, you try to merge their knowledge into one person.

The Problem: The "Clash of Styles"

In the world of AI, this is called Model Merging. We take three different AI models (the chefs) and try to combine their "weights" (their knowledge) into a single model.

The old way of doing this was like asking the chefs to just average out their recipes.

Chef A says: "Put 10 cups of flour in the pasta."
Chef B says: "Put 2 cups of flour in the sushi."
The Average: "Put 6 cups of flour in everything."

Result: The pasta is dry, and the sushi is a floury mess. The models interfere with each other. The AI gets confused, and performance drops.

The Previous Fix: The "Top-10% Rule"

Researchers realized that instead of averaging everything, they should look at the "ingredients" of the knowledge. They used a mathematical tool called SVD (Singular Value Decomposition) to break the knowledge down into layers of importance.

They decided to keep only the top 10% most important ingredients (the "Top-K" rule) and throw the rest away, hoping this would reduce the noise.

The Flaw: This is like a rigid rulebook that says, "Always keep the top 10% of ingredients for every dish."

Issue 1: Sometimes, the "most important" ingredient for the pasta chef (like a specific spice) is actually terrible for the sushi chef. Keeping it causes a flavor clash.
Issue 2: Some dishes are simple (like plain rice) and only need a few ingredients. Others are complex (like a multi-layer cake) and need many ingredients. A fixed "Top 10%" rule treats a simple dish and a complex cake exactly the same, which is inefficient.

The Solution: AdaRank (The Smart Sous-Chef)

The authors of this paper propose AdaRank. Think of AdaRank as a smart, adaptive Sous-Chef who doesn't follow a rigid rulebook. Instead, they taste the food and adjust the recipe in real-time.

Here is how AdaRank works, using our kitchen analogy:

1. The "Binary Mask" (The Ingredient Switch)

Instead of just keeping the "Top 10%" of ingredients, AdaRank gives the chef a switch for every single ingredient.

Switch ON: Keep this ingredient.
Switch OFF: Throw this ingredient away.

Crucially, the chef can decide to turn OFF a "top" ingredient if it causes a clash, and turn ON a "bottom" (less obvious) ingredient if it helps a specific dish. It's not about rank; it's about what actually works.

2. "Test-Time Adaptation" (The Tasting Session)

How does the chef know which switches to flip? They don't have the recipe book (training data) anymore. Instead, they use Test-Time Adaptation.

Imagine the chef is about to serve the food to customers (the test data). They don't know the customers' names, but they can see if the customers look happy or unhappy.

The chef tries a combination of ingredients.
If the customers look confused (high "entropy" or uncertainty), the chef knows, "Oops, that ingredient is causing a clash."
The chef flips a switch, removes the bad ingredient, or adds a missing one.
They repeat this until the customers are smiling (low entropy).

This happens automatically and instantly before the model is even used for real tasks.

Why is this a Big Deal?

It's Flexible: It realizes that the "Pasta" task needs a different set of ingredients than the "Sushi" task. It doesn't force a one-size-fits-all solution.
It's Efficient: It doesn't need to keep three separate kitchens (three separate models). It fits everything into one kitchen, but the kitchen is now organized perfectly for every dish.
It's Smarter: By looking at the actual result (the customer's reaction) rather than a pre-set rule (Top 10%), it avoids the "clashes" that ruin the meal.

The Bottom Line

AdaRank is like upgrading from a rigid, rule-following robot chef to a taste-sensitive, adaptive master chef. It looks at the specific needs of every task, prunes the ingredients that cause trouble, and keeps the ones that help, resulting in a single AI model that is almost as good as having three separate experts, but without the extra cost or space.

In the paper's experiments, this method made merged AI models perform significantly better, closing the gap between a "merged" model and a "perfectly trained" individual model, all while using the same amount of computer memory.

1. Problem Statement

Model merging aims to integrate multiple independently fine-tuned models into a single unified framework to enable efficient multi-task learning without retraining. While Task Arithmetic (simple weight averaging of task vectors) is a baseline, it suffers from inter-task interference, where adding knowledge for one task degrades performance on others.

Recent solutions utilize Singular Value Decomposition (SVD) to truncate task vectors to a low-rank subspace (keeping only the top- $k$ singular components), assuming this reduces interference. However, the authors identify two critical limitations in existing SVD-based methods:

Suboptimal Top- $k$ Selection: The top singular components (those with the largest singular values), while best for reconstructing a single task, often introduce significant inter-task interference when merged with other tasks. They can degrade the performance of dissimilar tasks more than they benefit the target task.
Fixed Rank Inflexibility: Existing methods enforce a fixed rank ( $k$ ) across all tasks and all layers. However, the intrinsic complexity (intrinsic rank) of task vectors varies significantly depending on the task difficulty and the specific layer (e.g., early layers capture shared features requiring higher rank, while later layers are task-specific). A fixed $k$ either discards critical information for complex tasks or retains noisy, interfering components for simpler ones.

2. Methodology: AdaRank

The authors propose AdaRank (Adaptive Rank Pruning), a framework that replaces rigid top- $k$ heuristics with a dynamic, data-driven selection of singular components.

Core Mechanism

Binary Masking: Instead of keeping the top- $k$ $k$ components, AdaRank introduces a learnable binary mask $B \in \{0, 1\}$ $B \in {0, 1}$ for every singular component of every task vector in every layer.
- $B_{ir} = 1$ : Preserve the $r$ -th singular component of task $i$ .
- $B_{ir} = 0$ : Prune (remove) the component.
Flexible Subspace: This allows the model to select any combination of singular components, not just the top ones. It enables the model to prune high-value components that cause interference and preserve low-value components that are beneficial.

Optimization via Test-Time Adaptation (TTA)

Since the merged model does not have access to ground-truth labels during the merging phase, AdaRank optimizes the binary masks using Shannon Entropy Minimization on unlabeled test data.

Objective: Minimize the sum of output entropies across all tasks:
$\arg\min_{B} \sum_{i=1}^{T} \sum_{x_i \in D_i} H(f(\theta(B), x_i))$
Lower entropy implies higher confidence in predictions, which correlates with reduced inter-task interference.
Optimization Algorithm: The binary masks are optimized using the Straight-Through Estimator (STE).
- Forward Pass: Continuous mask parameters are passed through a sigmoid function and rounded to $\{0, 1\}$ .
- Backward Pass: Gradients flow through the continuous parameters to update the mask selection.

3. Key Contributions

Empirical Analysis of SVD Limitations: The paper provides rigorous empirical evidence that:
- Top singular components often cause net increases in multi-task loss due to interference with dissimilar tasks.
- Intrinsic ranks vary widely across tasks and layers, making fixed-rank truncation suboptimal.
Adaptive Rank Pruning Framework: AdaRank is the first method to dynamically select singular components per-task and per-layer using learnable binary masks, effectively creating a "custom" low-rank subspace for each task vector.
Unsupervised Optimization: It successfully utilizes entropy minimization on unlabeled data to guide the selection of beneficial components, avoiding the need for labeled test data or retraining.
Efficiency: Unlike router-based methods (e.g., MoErging) that store separate parameters for each task, AdaRank produces a single merged model with the same size as a standard fine-tuned model, requiring no additional storage for task-specific parameters.

4. Experimental Results

The authors evaluated AdaRank on diverse backbones (ViT-B/32, ViT-L/14 for vision; RoBERTa, GPT-2 for NLP) across 8, 14, and 20 tasks.

Performance Gains:
- Vision: AdaRank significantly outperforms static methods (Task Arithmetic, TIES-Merging) and adaptive methods (AdaMerging). For example, on ViT-B/32 with 8 tasks, CART+AdaRank achieved 89.2% accuracy, surpassing the best static baseline (TSV-M at 83.8%) and the best adaptive baseline (AdaMerging at 80.1%).
- NLP: Similar improvements were observed on GLUE benchmarks, with AdaRank consistently narrowing the gap between merged models and individually fine-tuned models.
Comparison with Router-Based Methods:
- Router-based methods (e.g., Twin-Merging, WEMoE) scale linearly with the number of tasks, requiring significantly more parameters (up to 10x larger for 20 tasks).
- AdaRank maintains a constant model size (1x) while achieving comparable or superior performance, especially as the number of tasks increases.
Ablation Studies:
- Bottom Components: The method frequently selects components outside the top- $k$ range (including bottom components), confirming that low-rank truncation alone is insufficient.
- Adaptive Ranks: The learned ranks closely match the intrinsic rank of the data, validating the adaptability of the method.
- Data Efficiency: AdaRank remains robust even with only 1% of the test data available for TTA, outperforming baselines trained on 100% of the data.

5. Significance

AdaRank represents a paradigm shift in model merging by moving from heuristic, static truncation to adaptive, data-driven selection.

Theoretical Insight: It challenges the assumption that "larger singular values = better," demonstrating that in multi-task settings, the direction of the component relative to the loss landscape matters more than its magnitude.
Practical Impact: It offers a highly efficient solution for deploying multi-task models in resource-constrained environments. By eliminating the need for task-specific parameters (unlike router-based MoE) and avoiding the performance drop of naive averaging, AdaRank enables scalable, high-performance multi-task inference with a single, compact model.
Generalizability: The method is model-agnostic and modality-agnostic, working effectively across Vision Transformers, CNNs, and Language Models (both bidirectional and autoregressive).