OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

This paper introduces a comprehensive benchmark for Multimodal LLM (MLLM) merging, proposes a novel noise-removal and interaction-based optimization algorithm that improves performance by 2.48%, and demonstrates that merging diverse modality-specific models effectively creates superior Omni-language models without requiring additional training data.

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you have a team of brilliant specialists. You have a Geometry Genius who can solve complex math problems, an OCR Expert who can read any handwritten note, a Chart Analyst who understands graphs, and a Grounding Pro who can point out exactly where an object is in a photo.

Right now, to use all these skills, you'd need to keep four different computers running, each loaded with one specialist's brain. This is expensive, slow, and takes up a lot of space.

The Problem:
In the world of AI, training a new "super-brain" from scratch is like building a skyscraper: it takes years, costs millions of dollars, and requires massive amounts of energy. Meanwhile, these specialists are constantly getting better on their own, but they are stuck in their own silos.

The Solution: OptMerge
This paper introduces a clever trick called OptMerge. Instead of building a new skyscraper, they take the existing brains of these specialists and merge them into a single, super-capable brain without needing any new training data.

Here is how they did it, explained with some everyday analogies:

1. The "Task Vector" (The Memory of Learning)

When a specialist learns a new skill, their brain changes slightly. In AI terms, these changes are called Task Vectors.

  • The Analogy: Imagine the base AI model is a blank notebook. When the "Geometry Expert" learns math, they scribble notes in the notebook. When the "OCR Expert" learns to read, they scribble different notes.
  • The Challenge: If you just take the "Geometry" notebook and the "OCR" notebook and tape them together, the scribbles might overlap, smudge, or cancel each other out. The result is a messy notebook where neither skill works well.

2. The "Noise" Problem

Previous methods tried to merge these notebooks by simply averaging the scribbles. But the paper found that these scribbles contain a lot of noise (random, useless marks) and redundancy (the same note written three times).

  • The Analogy: It's like trying to mix two smoothies. If one has a lot of ice chunks (noise) and the other has too much water (redundancy), the final drink tastes bad.

3. The OptMerge Magic (Straining the Smoothie)

The authors created a new method to clean up the mix before combining them.

  • Step 1: Filtering the Noise. They use a mathematical tool (called SVD) to act like a fine-mesh strainer. It separates the "good stuff" (the actual knowledge of geometry or reading) from the "bad stuff" (the noise).
  • Step 2: The Perfect Blend. They then mix the cleaned-up "knowledge" together.
  • The Result: They get a single model that is better than the average of its parts. In fact, this merged model often performs better than if you had tried to train a new model from scratch using all the data combined (which is called "Mixture Training").

4. Going Beyond Text and Images (The "Omni" Brain)

Most AI models today are either "Vision-Language" (seeing and talking) or "Audio-Language" (hearing and talking).

  • The Analogy: Imagine a person who can only see and speak, and another who can only hear and speak.
  • The Innovation: OptMerge can take the "Vision" brain and the "Audio" brain and merge them into an "Omni-Brain" that can see, hear, and speak all at once. They did this without needing to collect thousands of new videos with sound and pictures to train it. They just merged the existing brains!

Why This Matters

  • It's Free (Data-wise): You don't need to find new data to train the model. You just use the models that already exist.
  • It's Fast: Merging takes minutes or hours, not months.
  • It's Efficient: Instead of storing 10 different models, you only need to store 1.
  • It's Better: The merged model is often smarter than the individual experts because it combines their strengths.

The Bottom Line

Think of OptMerge as a master chef who can take the secret recipes from five different Michelin-star restaurants (the specialist models), clean up the ingredients, and combine them into one perfect, all-in-one dish. You get the best of all worlds without having to hire five different chefs or build five different kitchens.

The paper proves that by carefully cleaning and combining these AI "brains," we can build smarter, more versatile AI systems faster, cheaper, and with less energy than ever before.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →