DC-Merge: Improving Model Merging with Directional Consistency

Imagine you have a team of eight different experts. One is a master of spotting cars, another is a genius at identifying flowers, and a third is an expert in reading German traffic signs. Each expert has been trained specifically for their job.

Now, you want to create a "Super-Expert" who knows everything about all eight topics at once. You don't want to retrain them from scratch (which is expensive and slow). Instead, you want to merge their brains into one model.

This is what Model Merging does. But here's the problem: if you just take the "brain" of the car expert and the "brain" of the flower expert and smash them together, the result is often a confused mess. The Super-Expert might forget how to spot cars or start calling flowers "cars."

The paper DC-Merge proposes a smarter way to do this. Here is the simple explanation of their idea, using some creative analogies.

The Problem: The "Loud Voice" and the "Wrong Map"

The authors discovered two main reasons why simple merging fails:

1. The "Loud Voice" Problem (Imbalanced Energy)
Imagine the Car Expert's brain is a library. 90% of the books in this library are about "Red Sports Cars." Only a few books are about "Vintage Trucks" or "Electric Bikes."

The Issue: When you merge this brain with others, the "Red Sports Cars" section is so loud and dominant that it drowns out the quiet, important details about the other vehicles. The model becomes obsessed with the most common patterns and ignores the subtle, important ones.
The DC-Merge Fix: They use a technique called Energy Smoothing. Imagine a sound engineer turning down the volume of the "Red Sports Cars" section and turning up the volume of the "Vintage Trucks" section. Now, every part of the expert's knowledge gets a fair chance to be heard. No single topic dominates the conversation.

2. The "Wrong Map" Problem (Geometric Inconsistency)
Imagine the Car Expert thinks in a 3D space where "Up" means "Fast" and "Left" means "Slow." The Flower Expert thinks in a different 3D space where "Up" means "Colorful" and "Left" means "Fragrant."

The Issue: If you try to merge them directly, you are trying to combine two maps that use different directions. "Up" on one map doesn't match "Up" on the other. The result is a distorted, confusing map where directions get twisted.
The DC-Merge Fix: They use a technique called Cover Space Merging. Before merging, they build a Universal Translator (a shared "Cover Space"). They translate the Car Expert's "Up" and the Flower Expert's "Up" into a new, neutral language where everyone agrees on what "Up" means. They merge the ideas in this neutral space, ensuring the directions stay true, and then translate the result back.

The Solution: DC-Merge in Action

The authors call their method DC-Merge (Directional Consistency Merge). Here is the step-by-step process:

Level the Playing Field: First, they take each expert's knowledge and "smooth out" the volume. They make sure the quiet, important details aren't drowned out by the loud, obvious ones.
Build the Universal Translator: They create a shared, neutral space (the Cover Space) where all the experts' directions align perfectly.
Merge in Neutral Territory: They combine the experts' knowledge inside this neutral space. Because everyone is speaking the same "directional language," the ideas blend smoothly without twisting or breaking.
Translate Back: Finally, they take this perfectly blended Super-Expert and translate it back into the original format so it can be used.

Why It Matters

The paper shows that by keeping the directions of the knowledge consistent (making sure "Fast" still means "Fast" and "Colorful" still means "Colorful" after the merge), the new model performs significantly better.

The Result: The new Super-Expert doesn't just know a little bit about everything; it retains the deep, specific skills of each original expert.
The Proof: They tested this on vision tasks (like recognizing images) and even on huge AI models that understand both images and text (like LLaVA). In almost every test, DC-Merge beat the previous best methods, creating a smarter, more versatile AI without needing extra training data.

The Big Picture Analogy

Think of model merging like making a smoothie.

Old Way: You throw a whole watermelon, a whole strawberry, and a whole banana into a blender. The watermelon juice (the "loud voice") takes over, and you barely taste the strawberry or banana. Plus, if you blend them in the wrong order, the texture gets weird (the "wrong map").
DC-Merge Way: First, you slice the watermelon and mix it with a little strawberry juice so the flavors are balanced (Energy Smoothing). Then, you blend them in a special container that ensures the fruit fibers align perfectly so the texture is smooth (Cover Space Merging). The result? A smoothie where you can taste every fruit perfectly.

In short: DC-Merge teaches AI how to listen to all its experts equally and speak the same language, resulting in a smarter, more capable model.

1. Problem Statement

Model merging aims to integrate multiple task-adapted models (fine-tuned on different tasks) into a single unified model without retraining. While promising, existing methods often suffer from significant performance degradation after merging, particularly when tasks come from heterogeneous domains.

The authors identify two fundamental issues causing this failure:

Imbalanced Energy Distribution: Task vectors (the difference between fine-tuned and pre-trained weights) exhibit a long-tailed distribution of singular values. A small fraction of high-energy singular values dominates the vector, causing merging processes to overemphasize these dominant directions while neglecting weaker but semantically important knowledge components.
Geometric Inconsistency (Basis Misalignment): Different tasks span heterogeneous low-rank subspaces in the parameter space. Directly merging vectors in the original parameter space leads to basis misalignment, distorting the underlying directional geometry of the knowledge components.

The core hypothesis is that directional consistency—preserving the orientation of knowledge components relative to the original task vectors—is more critical for retaining task capabilities than preserving the exact energy distribution.

2. Methodology: DC-Merge

The proposed DC-Merge framework addresses the above issues through two complementary modules: Energy Smoothing and Cover Space Merging.

A. Theoretical Foundation: Directional Similarity (DirSim)

The authors introduce a new metric, DirSim, to quantify the consistency between task vectors.

Unlike Cosine Similarity, which is heavily weighted by high-energy components, DirSim isolates directional geometry by uniformizing the singular values (energy distribution).
Empirical analysis shows a strong positive correlation between DirSim and post-merge task performance, validating that preserving direction is the key to knowledge retention.

B. Module 1: Energy Smoothing

To prevent the merging process from ignoring semantically rich but weak components, DC-Merge first balances the energy distribution of each task vector.

Process: The singular values ( $\sigma$ ) of a task vector are decomposed via SVD. Instead of using the original skewed $\sigma$ , the method replaces them with a smoothed version (e.g., averaging all top- $r$ singular values).
Effect: This ensures all retained knowledge components contribute equally, preventing "representational collapse" and enhancing the vector's ability to generalize across tasks.

C. Module 2: Cover Space Merging

To resolve basis misalignment, the method projects all smoothed task vectors into a shared orthogonal subspace (Cover Space) before aggregation.

Cover Basis Construction: The authors construct a shared orthonormal basis $(\tilde{U}, \tilde{V})$ by concatenating the singular vectors of all task vectors and applying a whitening transformation. This creates a "cover" that captures the directional geometry of all tasks.
Projection & Aggregation:
1. Each smoothed task vector is projected onto this shared cover space.
2. The projected vectors are aggregated using existing element-wise merging strategies (e.g., Task Arithmetic or TIES-Merging).
3. The aggregated vector is projected back to the original parameter space.
Structural Masking: A block-diagonal mask is applied during reconstruction to suppress off-diagonal elements, further mitigating cross-task directional interference.

3. Key Contributions

Identification of Directional Consistency: The paper establishes that preserving the directional geometry of knowledge components is the primary factor in successful model merging, distinct from energy distribution.
DirSim Metric: Introduction of a novel metric that measures directional consistency independent of energy magnitude, showing a strong correlation with merged model performance.
DC-Merge Algorithm: A two-stage method (Energy Smoothing + Cover Space Merging) that explicitly enforces directional consistency.
State-of-the-Art Performance: The method achieves superior results across both Full Fine-Tuning (FFT) and Low-Rank Adaptation (LoRA) settings on diverse benchmarks.

4. Experimental Results

The authors evaluated DC-Merge on vision (ViT-B-32, ViT-B-16, ViT-L-14) and vision-language (LLaVA-v1.5-7B) models.

Vision Benchmarks (LoRA & FFT):
- On the 8-task, 12-task, and 16-task LoRA benchmarks, DC-Merge consistently outperformed state-of-the-art baselines (Task Arithmetic, TIES-Merging, TSV-M, Iso-CTS).
- In FFT settings, DC-Merge achieved the highest average normalized accuracy, with performance gains increasing as the number of tasks grew (e.g., +10.55% improvement over baselines in 20-task FFT settings).
Vision-Language Benchmarks:
- On MM-MergeBench (8 seen + 4 unseen tasks), DC-Merge achieved the best performance on both seen and unseen tasks, demonstrating strong generalization to large multi-modal models.
Ablation Studies:
- Energy Smoothing: Significantly improved performance, with linear smoothing often outperforming simple averaging.
- Cover Space: Merging in the shared cover space provided substantial gains over merging in the original parameter space.
- Masking: Structural masking was crucial, especially in FFT settings, to prevent directional inconsistency.

5. Significance and Impact

Theoretical Insight: The paper shifts the focus of model merging from simple parameter averaging or sign conflict resolution to the preservation of directional geometry. It provides a mathematical framework (DirSim) to measure this property.
Practical Utility: DC-Merge offers a robust, data-free solution for creating multi-task models. It is particularly effective in scenarios with many heterogeneous tasks where traditional merging fails.
Scalability: The method scales effectively from small vision models to large vision-language models (VLMs), suggesting a generalizable approach for future foundation model integration.
Efficiency: While it involves SVD and whitening, the computational complexity remains comparable to existing SVD-based methods ( $O(TLn^3)$ ), making it feasible for practical deployment.

In summary, DC-Merge demonstrates that by balancing the energy of knowledge components and aligning them within a shared geometric subspace, one can create unified models that retain the full capability of their constituent tasks.

DC-Merge: Improving Model Merging with Directional Consistency

The Problem: The "Loud Voice" and the "Wrong Map"

The Solution: DC-Merge in Action

Why It Matters

The Big Picture Analogy

1. Problem Statement

2. Methodology: DC-Merge

A. Theoretical Foundation: Directional Similarity (DirSim)

B. Module 1: Energy Smoothing

C. Module 2: Cover Space Merging

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Exploring AI in Fashion: A Review of Aesthetics, Personalization, Virtual Try-On, and Forecasting

Rule Extraction in Machine Learning: Chat Incremental Pattern Constructor

Inverse classification with logistic and softmax classifiers: efficient optimization

BarcodeBERT: Transformers for Biodiversity Analysis

On Minimal Depth in Neural Networks