MERGETUNE: Continued Fine-Tuning of Vision-Language Models

This paper introduces MERGETUNE, a model-agnostic continued fine-tuning strategy that leverages linear mode connectivity and a second-order surrogate to recover pretrained knowledge in vision-language models after adaptation, thereby mitigating catastrophic forgetting and achieving state-of-the-art performance without additional parameters or data replay.

Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, well-traveled librarian named CLIP. This librarian has read millions of books and seen billions of pictures. Because of this, they can guess what a picture is about just by looking at it, even if they've never seen that specific type of picture before. This is called "zero-shot" learning.

However, if you hire this librarian to work specifically in a Cat Museum and train them for a few weeks to recognize different cat breeds, something strange happens. They become amazing at spotting cats, but they start forgetting everything else. They might look at a picture of a car and think, "Is that a very strange cat?" They have suffered from catastrophic forgetting.

The Problem: The "Specialist" Trap

Most current methods try to prevent the librarian from forgetting while they learn about cats. They use special techniques (like adding a small notebook to the librarian's desk) to help them remember. But often, the librarian still loses some of their general knowledge.

The authors of this paper, MERGETUNE, asked a different question: What if we accept that the librarian has already forgotten some things, and then try to fix it afterwards?

The Solution: The "Memory Bridge"

The paper proposes a new method called MERGETUNE. Think of it as building a bridge between two different versions of the librarian:

  1. The Generalist (Zero-Shot): The original librarian who knows everything but isn't great at cats yet.
  2. The Specialist (Fine-Tuned): The librarian who is amazing at cats but has forgotten how to recognize cars or dogs.

Usually, these two librarians live in different "neighborhoods" of the brain (mathematically speaking). If you try to simply mix their brains together (average their weights), it's like trying to blend oil and water; the result is messy and doesn't work well.

MERGETUNE uses a concept called Linear Mode Connectivity. Imagine the librarian's brain as a mountain range.

  • The Generalist lives in a valley on the left.
  • The Specialist lives in a valley on the right.
  • Usually, there is a huge, steep mountain between them. If you walk from one to the other, you fall into a deep pit (performance drops).

MERGETUNE's job is to dig a tunnel or build a smooth, flat road between these two valleys. It does this by gently adjusting the Specialist's brain, searching for a new "hybrid" librarian who can walk smoothly back to the Generalist's valley without falling, and also walk smoothly back to the Specialist's valley without falling.

How It Works (The Magic Trick)

Normally, to build this road, you would need to show the librarian the original millions of books and pictures they learned from in the first place. But those books are lost, too big, or private.

MERGETUNE is clever. Instead of re-reading the millions of books, it uses a mathematical shortcut (a "second-order surrogate"). It's like looking at the librarian's current brain structure and guessing, "Based on how your brain is shaped, you must have learned these things originally." It uses this guess to gently nudge the Specialist back toward the Generalist's knowledge without needing the original data.

The Results: The Best of Both Worlds

After applying MERGETUNE, the result is a librarian who:

  • Is still amazing at recognizing cats (the new task).
  • Has recovered their ability to recognize cars, dogs, and landscapes (the old knowledge).
  • Doesn't need to carry two different brains or run two different programs at the same time (unlike other methods that just average two models).

Why This Matters

In the real world, this means we can take powerful AI models, teach them new specific jobs (like diagnosing a specific disease or recognizing a specific type of defect in manufacturing), and then use MERGETUNE to ensure they don't lose their general "common sense."

In short: MERGETUNE is like a memory therapist for AI. It takes an AI that has become too specialized and forgotten its roots, and gently guides it back to a state where it is both a world-class expert and a well-rounded generalist, all without needing to re-teach it from scratch.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →