OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Imagine you have a team of brilliant specialists. You have a Geometry Genius who can solve complex math problems, an OCR Expert who can read any handwritten note, a Chart Analyst who understands graphs, and a Grounding Pro who can point out exactly where an object is in a photo.

Right now, to use all these skills, you'd need to keep four different computers running, each loaded with one specialist's brain. This is expensive, slow, and takes up a lot of space.

The Problem:
In the world of AI, training a new "super-brain" from scratch is like building a skyscraper: it takes years, costs millions of dollars, and requires massive amounts of energy. Meanwhile, these specialists are constantly getting better on their own, but they are stuck in their own silos.

The Solution: OptMerge
This paper introduces a clever trick called OptMerge. Instead of building a new skyscraper, they take the existing brains of these specialists and merge them into a single, super-capable brain without needing any new training data.

Here is how they did it, explained with some everyday analogies:

1. The "Task Vector" (The Memory of Learning)

When a specialist learns a new skill, their brain changes slightly. In AI terms, these changes are called Task Vectors.

The Analogy: Imagine the base AI model is a blank notebook. When the "Geometry Expert" learns math, they scribble notes in the notebook. When the "OCR Expert" learns to read, they scribble different notes.
The Challenge: If you just take the "Geometry" notebook and the "OCR" notebook and tape them together, the scribbles might overlap, smudge, or cancel each other out. The result is a messy notebook where neither skill works well.

2. The "Noise" Problem

Previous methods tried to merge these notebooks by simply averaging the scribbles. But the paper found that these scribbles contain a lot of noise (random, useless marks) and redundancy (the same note written three times).

The Analogy: It's like trying to mix two smoothies. If one has a lot of ice chunks (noise) and the other has too much water (redundancy), the final drink tastes bad.

3. The OptMerge Magic (Straining the Smoothie)

The authors created a new method to clean up the mix before combining them.

Step 1: Filtering the Noise. They use a mathematical tool (called SVD) to act like a fine-mesh strainer. It separates the "good stuff" (the actual knowledge of geometry or reading) from the "bad stuff" (the noise).
Step 2: The Perfect Blend. They then mix the cleaned-up "knowledge" together.
The Result: They get a single model that is better than the average of its parts. In fact, this merged model often performs better than if you had tried to train a new model from scratch using all the data combined (which is called "Mixture Training").

4. Going Beyond Text and Images (The "Omni" Brain)

Most AI models today are either "Vision-Language" (seeing and talking) or "Audio-Language" (hearing and talking).

The Analogy: Imagine a person who can only see and speak, and another who can only hear and speak.
The Innovation: OptMerge can take the "Vision" brain and the "Audio" brain and merge them into an "Omni-Brain" that can see, hear, and speak all at once. They did this without needing to collect thousands of new videos with sound and pictures to train it. They just merged the existing brains!

Why This Matters

It's Free (Data-wise): You don't need to find new data to train the model. You just use the models that already exist.
It's Fast: Merging takes minutes or hours, not months.
It's Efficient: Instead of storing 10 different models, you only need to store 1.
It's Better: The merged model is often smarter than the individual experts because it combines their strengths.

The Bottom Line

Think of OptMerge as a master chef who can take the secret recipes from five different Michelin-star restaurants (the specialist models), clean up the ingredients, and combine them into one perfect, all-in-one dish. You get the best of all worlds without having to hire five different chefs or build five different kitchens.

The paper proves that by carefully cleaning and combining these AI "brains," we can build smarter, more versatile AI systems faster, cheaper, and with less energy than ever before.

1. Problem Statement

Foundation models (FMs) evolve slowly due to the high resource costs of training, while domain-specific models improve rapidly through fine-tuning. Model merging offers a data-free, cost-effective solution to combine multiple expert models into a single unified model, reducing storage and serving costs.

However, significant gaps exist in applying model merging to Multimodal Large Language Models (MLLMs):

Lack of Benchmarks: Existing benchmarks focus on visual classification or LLMs for code/math. There is no standardized benchmark for MLLMs that clearly divides capabilities (e.g., VQA, Geometry, OCR) or evaluates the merging of different modalities (vision, audio, video).
Inefficiency of Current Methods: Existing merging algorithms often struggle with task vector interference (noise and redundancy in parameter updates) and fail to handle the specific optimization challenges of LoRA (Low-Rank Adaptation) versus Full Fine-Tuning.
Data Dependency: Many advanced merging techniques require unlabeled test data or hyperparameter search, which contradicts the "data-free" goal of efficient model merging.
Modality Silos: Integrating new modalities (e.g., audio or video) into an LLM typically requires expensive re-training on new multimodal instruction data.

2. Methodology

The paper proposes OptMerge, a novel framework comprising a new benchmark and an optimized merging algorithm.

A. The MLLM Merging Benchmark

The authors constructed a comprehensive benchmark to evaluate merging across diverse capabilities and modalities:

Capabilities: Five specialized tasks: VQA (Visual Question Answering), Geometry, Chart understanding, OCR, and Grounding.
Modalities: Exploration of merging Vision-Language, Audio-Language, and Video-Language models to create an "Omni-language" model.
Base Models: Two distinct architectures were used to test generalizability:
- InternVL2.5-1B: Used for Full Fine-Tuning.
- Qwen2-VL-7B: Used for LoRA Fine-Tuning.
Data: Collected over 100k public samples per task to ensure robust supervised fine-tuning of expert models.

B. Theoretical Insights

The authors derived a theoretical upper bound on the loss of a merged model (Theorem 3.1), proving that merging performance is heavily influenced by the learning rate and number of iterations during fine-tuning.

Key Finding: Excessive fine-tuning (large parameter drift) leads to poor merging results due to cross-task interference and curvature errors. Optimal merging occurs when fine-tuned models remain close to the base model in parameter space (small $\eta T$ ).

C. The OptMerge Algorithm

OptMerge improves upon existing optimization-based merging (like WUDI Merging) by addressing noise and optimization instability. It handles Full Fine-Tuning and LoRA differently:

For Full Fine-Tuning (Noise Reduction via SVD):
- Task vectors contain redundancy and noise. OptMerge applies Singular Value Decomposition (SVD) to isolate core task-specific knowledge.
- It centers the task vectors and truncates the singular values, keeping only the top- $k$ components. This acts as a low-rank approximation to remove noise while preserving critical features.
- The merged vector is optimized via gradient descent on a loss function defined over these denoised task vectors.
For LoRA Fine-Tuning (Stability via SGD & Initialization):
- LoRA vectors are inherently low-rank, creating a "null space" where gradients are zero. Standard optimizers (like Adam) can cause the merged vector to explode in magnitude to minimize loss, leading to catastrophic forgetting.
- Strategies:
  - Optimizer Switch: Replaces Adam with SGD, which provides implicit regularization and better handles sparse gradients.
  - Initialization: Initializes the merged vector with the mean of task vectors to prevent magnitude explosion.
  - Direct Truncation: Applies SVD truncation directly to the task vectors without centering to reduce the Frobenius norm.

3. Key Contributions

First MLLM Merging Benchmark: Introduced a fine-grained benchmark categorizing MLLM capabilities (VQA, Geometry, Chart, OCR, Grounding) and evaluating modality unification (Vision, Audio, Video).
OptMerge Method: Proposed a robust, data-free merging method that removes noise from task vectors and stabilizes optimization for both Full Fine-Tuning and LoRA.
Theoretical Analysis: Provided the first theoretical explanation linking fine-tuning hyperparameters (learning rate, iterations) to merging performance, highlighting the trade-off between task performance and parameter drift.
Empirical Validation: Demonstrated that model merging can outperform mixture training (training on combined data) and individual expert models without requiring any training data.

4. Experimental Results

Capability Merging:
- OptMerge achieved an average performance gain of 2.48% over state-of-the-art baselines (like WUDI Merging and TIES-Merging).
- On the Qwen2-VL (LoRA) benchmark, OptMerge achieved a score of 63.30, surpassing the mixture training baseline (62.23) and individual expert models.
- On InternVL2.5 (Full FT), OptMerge matched or slightly exceeded mixture training results.
Modality Merging (Omni-Model):
- Merging Vision, Audio, and Video models resulted in superior performance on Audio-Visual QA tasks compared to models trained on single modalities or online composition methods (e.g., NaiveMC, DAMC).
- This confirms the complementarity of modal information.
Efficiency:
- Time: OptMerge took 0.22 hours (InternVL) and 3.78 hours (Qwen2-VL) to merge, compared to 25+ hours for mixture training.
- Memory: Required significantly less GPU memory (e.g., 2.62GB vs 240GB for InternVL) as it does not require loading massive training datasets.
Generalization:
- The method successfully merged real-world checkpoints from Hugging Face (e.g., Pokemon domain, PDF-to-text, Math reasoning), proving practical utility.
- Evaluated on general benchmarks (MMMU, ScienceQA), the merged model showed emergent capabilities, outperforming the best individual expert by an average of 10.85%.

5. Significance

Scalable MLLM Development: OptMerge provides a pathway to build powerful, multi-capability MLLMs without the prohibitive cost of re-training on massive datasets.
Decentralized AI: It supports a decentralized development model where independent contributors can fine-tune models for specific tasks, which can later be merged into a unified, high-performing system.
Data Privacy: By being data-free, it mitigates privacy concerns associated with reusing proprietary training data for model combination.
Omni-Modal Alignment: It offers a novel, efficient approach to aligning LLMs with multiple modalities (vision, audio, video) simultaneously, moving toward the "Omni-language" model concept.

In conclusion, OptMerge establishes a new standard for MLLM merging, proving that careful optimization of task vectors can yield models that are not only more capable than their individual components but also more efficient to create than traditional mixture training.