BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

Imagine you have a team of eight different experts. One is a master of identifying cars, another knows everything about traffic signs, a third is an expert on satellite images, and so on. Each of them has studied hard and is brilliant at their specific job.

The Problem: The "Blind Merge"
Now, imagine you want to combine all these experts into a single "Super-Expert" who can do all their jobs at once. This is called Model Merging.

Usually, when we combine them, we just take an average of their knowledge. It's like asking all eight experts to vote on an answer and picking the majority. This works great in a quiet classroom where everyone is calm and the questions are exactly what they studied.

But in the real world, things get messy.

The "Noise": Maybe the car expert is looking at a blurry photo taken in the rain (sensor noise).
The "Surprise": Maybe the traffic sign expert is asked to identify a type of sign they've never seen before (unseen tasks).

When this happens, the "Super-Expert" gets confused. Because the old methods assume everything is perfect and clean, the Super-Expert starts making bad guesses, getting biased, and failing to adapt. It's like a GPS that works perfectly on a sunny day but gets lost the moment it starts raining.

The Solution: BD-Merging (The "Smart Detective")
The paper introduces BD-Merging, a new way to combine these experts that acts like a smart, bias-aware detective. Instead of blindly averaging everyone's opinion, it uses three clever tricks to stay reliable even when the world gets messy.

1. The "Uncertainty Meter" (Joint Evidential Head)

Imagine every time the Super-Expert looks at a picture, they don't just say, "That's a car!" They also have a little internal meter that says, "I'm 90% sure, but it's a bit blurry, so I'm a little nervous."

BD-Merging adds a special tool called a Joint Evidential Head. This tool measures how sure the model is about its answer.

If the model is confident, the meter stays low.
If the image is blurry or weird, the meter goes high, signaling, "Hey, something is off here!"

This helps the model realize when it's looking at "corrupted" data (like a foggy photo) versus a normal one.

2. The "Neighbor Check" (Adjacency Discrepancy Score)

Next, the detective looks at the neighbors. Imagine you are in a crowd. If everyone around you is calm and agreeing on what they see, you probably feel safe. But if you see a group of people arguing or looking confused, you know something is wrong.

BD-Merging uses a score called ADS (Adjacency Discrepancy Score) to check the "vibe" of nearby data points.

The Good Neighbors: If the model sees a clear car, and its "neighbors" (similar images) also agree it's a car, the score is low. Everything is aligned.
The Bad Neighbors: If the model sees a blurry mess, and its neighbors are confused or disagreeing, the score goes high. This tells the system: "Stop! This data is suspicious. Don't trust the usual rules."

3. The "Smart Switchboard" (Debiased Router)

This is the most important part. In the old days, the Super-Expert used the same mix of knowledge for everyone. If you showed them a blurry car, they used the same "car knowledge" as if it were a crystal-clear photo.

BD-Merging introduces a Debiased Router. Think of this as a smart switchboard operator.

When a clean, clear image comes in, the operator says, "Okay, let's use the standard car expert's knowledge."
When a blurry, noisy, or weird image comes in, the operator sees the high "Uncertainty Meter" and the "Bad Neighbor" score. They immediately flip a switch: "Okay, this is tricky. Let's dial down the car expert's confidence and mix in some general knowledge to be safer."

It dynamically changes the recipe for every single image, ensuring the model doesn't get tricked by bad data.

The Result: A Super-Expert for the Real World

The paper tested this on many different tasks (like identifying cars, traffic signs, and satellite images) and added "noise" like fog, blur, and pixelation to simulate real-world problems.

Old Methods: When the data got messy, their accuracy dropped like a stone. They got confused and biased.
BD-Merging: It stayed steady. Because it knew when to be confident and when to be cautious, it handled the messy data much better. It was almost as good as having eight separate experts, but it only needed one combined model.

In a Nutshell:
BD-Merging is like upgrading a team of experts from a rigid committee that always votes the same way, into a flexible team that knows when to trust their training and when to pause and double-check because the situation looks suspicious. It makes AI safer and more reliable for the messy, unpredictable real world.

1. Problem Statement

Model Merging (MM) is a scalable paradigm for Multi-Task Learning (MTL) that integrates multiple task-specific models into a single unified network without accessing original training data. However, existing MM methods rely on the strong assumption that test-time data distributions align perfectly with training and auxiliary sources.

In real-world scenarios, this assumption often fails due to test-time distribution shifts, which manifest in two critical ways:

Test-time Bias: Intra-task corruption (e.g., sensor noise, environmental changes like fog or blur) that shifts inputs away from the merging distribution, leading to biased predictions.
Limited Cross-Task Generalization: Inter-task discrepancies where the merged model encounters unseen tasks or domains not represented during the merging process, causing performance degradation.

Current methods struggle to capture fine-grained, sample-level discrepancies under these shifts, resulting in conflicting knowledge integration and poor robustness.

2. Methodology: BD-Merging

The authors propose BD-Merging, a bias-aware, unsupervised framework that dynamically adjusts model behavior at test time. The core insight is to use evidential uncertainty to detect distribution shifts and guide adaptive representation alignment. The framework consists of three main modules:

A. Joint Evidential Head (Uncertainty Modeling)

To capture semantic dependencies and uncertainty across a unified label space, BD-Merging integrates a Joint Evidential Head into a pretrained backbone using Evidential Deep Learning (EDL) based on Dirichlet distributions.

Mechanism: Instead of standard softmax, the head outputs class-specific evidence ( $e$ ) to parameterize a Dirichlet distribution. This yields belief mass ( $b$ ), uncertainty ( $u$ ), and predictive probability ( $p$ ).
Inter-Class Evidential Contrast (IEC): To handle cross-task ambiguity, the model introduces an IEC metric ( $\nu$ ) that quantifies the relationship between the top predicted classes.
Training Objective: The head is trained using a combined loss:
- Inverse Correlation Loss ( $L_{Inv}$ ): Enforces an inverse relationship between uncertainty ( $u$ ) and IEC (high confidence should correlate with low uncertainty).
- Entropy-based Unsupervised Loss ( $L_{Ent}$ ): Regularizes the model using KL divergence against a non-informative prior to ensure robust uncertainty estimation without labels.

B. Adjacency Discrepancy Score (ADS)

Based on the evidential outputs, BD-Merging constructs an Adjacency Set for each sample in the feature space and calculates the ADS to quantify local alignment. The ADS is a product of three factors:

Prediction Sharpness: Measures the concentration of evidence within the neighborhood (epistemic strength).
Semantic Divergence: Quantifies class-level distributional deviation between a sample and its neighbors.
Opinion Conflicts: Measures belief disagreement between a sample and a specific neighbor, weighted by mutual confidence.

Function: A high ADS indicates a sample is likely corrupted or out-of-distribution (conflicting with neighbors), while a low ADS indicates consistency.

C. Discrepancy-Aware Contrastive Merging

The framework uses the ADS to guide a Debiased Router via a contrastive learning strategy.

Debiased Router: A lightweight network that dynamically assigns task-specific or layer-specific weights ( $w_k$ ) on a per-sample basis, rather than using static global weights.
Contrastive Learning: The adjacency set is partitioned into positive (low ADS) and negative (high ADS) pairs based on a threshold $\epsilon$ $ϵ$ .
- Positive Pairs: Samples with consistent evidence are pulled closer.
- Negative Pairs: Samples with conflicting evidence are pushed apart.
Overall Loss: The router is optimized using a combination of the Discrepancy-Aware Contrastive Loss ( $L_{Dis}$ ) and an unsupervised entropy minimization objective ( $L_{Unsup}$ ), allowing the model to learn adaptive weights that mitigate distribution shifts.

3. Key Contributions

Problem Identification: Re-evaluated the reliability of Model Merging under test-time distribution shifts, identifying "conflicting knowledge/biased integration" and "limited cross-task generalization" as primary failure modes.
Novel Framework (BD-Merging): Proposed a bias-aware framework that explicitly models sample-level bias using evidential uncertainty. Key innovations include:
- A Joint Evidential Head for unified uncertainty modeling.
- An Adjacency Discrepancy Score (ADS) to quantify local evidential alignment.
- A Discrepancy-Aware Contrastive Learning mechanism to refine merged representations.
Adaptive Routing: Introduced a Debiased Router that dynamically allocates weights per sample, effectively balancing task specialization and generalization without retraining on raw data.

4. Experimental Results

The authors evaluated BD-Merging on eight image classification datasets (e.g., SUN397, Cars, MNIST) using ViT backbones, comparing against SOTA methods like Task Arithmetic, Ties-Merging, AdaMerging, and Twin-Merging.

Robustness to Test-Time Bias: Under varying levels of corruption (Gaussian noise, blur, fog, etc.), BD-Merging demonstrated superior robustness.
- It achieved a 1.8% to 2.6% smaller performance drop compared to the best baselines under severe corruption (Level 3).
- It consistently outperformed baselines across different backbone sizes (ViT-B/32 to ViT-L/14).
Generalization to Unseen Tasks: In experiments mixing seen and unseen tasks:
- While baselines like AdaMerging and Twin-Merging saw accuracy drop sharply (e.g., from ~90% on seen tasks to ~50% on unseen), BD-Merging maintained a strong 55.01% accuracy on unseen tasks while achieving 94.53% on seen tasks.
- This indicates a superior balance between specialization and generalization, reducing overfitting to specific task patterns.
Efficiency: BD-Merging achieved near-optimal performance with significantly lower computational time costs compared to methods requiring surgery or complex auxiliary data alignment.

5. Significance

BD-Merging addresses a critical gap in the deployment of multi-task models in real-world, dynamic environments. By moving beyond static weight averaging and incorporating uncertainty-aware, sample-level adaptation, the method ensures that merged models remain reliable even when faced with noisy data or novel tasks. This makes it a highly practical solution for scalable, privacy-preserving, and robust multi-task learning systems where retraining is infeasible.

BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

1. The "Uncertainty Meter" (Joint Evidential Head)

2. The "Neighbor Check" (Adjacency Discrepancy Score)

3. The "Smart Switchboard" (Debiased Router)

The Result: A Super-Expert for the Real World

1. Problem Statement

2. Methodology: BD-Merging

A. Joint Evidential Head (Uncertainty Modeling)

B. Adjacency Discrepancy Score (ADS)

C. Discrepancy-Aware Contrastive Merging

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks