MERGETUNE: Continued Fine-Tuning of Vision-Language Models

Imagine you have a brilliant, well-traveled librarian named CLIP. This librarian has read millions of books and seen billions of pictures. Because of this, they can guess what a picture is about just by looking at it, even if they've never seen that specific type of picture before. This is called "zero-shot" learning.

However, if you hire this librarian to work specifically in a Cat Museum and train them for a few weeks to recognize different cat breeds, something strange happens. They become amazing at spotting cats, but they start forgetting everything else. They might look at a picture of a car and think, "Is that a very strange cat?" They have suffered from catastrophic forgetting.

The Problem: The "Specialist" Trap

Most current methods try to prevent the librarian from forgetting while they learn about cats. They use special techniques (like adding a small notebook to the librarian's desk) to help them remember. But often, the librarian still loses some of their general knowledge.

The authors of this paper, MERGETUNE, asked a different question: What if we accept that the librarian has already forgotten some things, and then try to fix it afterwards?

The Solution: The "Memory Bridge"

The paper proposes a new method called MERGETUNE. Think of it as building a bridge between two different versions of the librarian:

The Generalist (Zero-Shot): The original librarian who knows everything but isn't great at cats yet.
The Specialist (Fine-Tuned): The librarian who is amazing at cats but has forgotten how to recognize cars or dogs.

Usually, these two librarians live in different "neighborhoods" of the brain (mathematically speaking). If you try to simply mix their brains together (average their weights), it's like trying to blend oil and water; the result is messy and doesn't work well.

MERGETUNE uses a concept called Linear Mode Connectivity. Imagine the librarian's brain as a mountain range.

The Generalist lives in a valley on the left.
The Specialist lives in a valley on the right.
Usually, there is a huge, steep mountain between them. If you walk from one to the other, you fall into a deep pit (performance drops).

MERGETUNE's job is to dig a tunnel or build a smooth, flat road between these two valleys. It does this by gently adjusting the Specialist's brain, searching for a new "hybrid" librarian who can walk smoothly back to the Generalist's valley without falling, and also walk smoothly back to the Specialist's valley without falling.

How It Works (The Magic Trick)

Normally, to build this road, you would need to show the librarian the original millions of books and pictures they learned from in the first place. But those books are lost, too big, or private.

MERGETUNE is clever. Instead of re-reading the millions of books, it uses a mathematical shortcut (a "second-order surrogate"). It's like looking at the librarian's current brain structure and guessing, "Based on how your brain is shaped, you must have learned these things originally." It uses this guess to gently nudge the Specialist back toward the Generalist's knowledge without needing the original data.

The Results: The Best of Both Worlds

After applying MERGETUNE, the result is a librarian who:

Is still amazing at recognizing cats (the new task).
Has recovered their ability to recognize cars, dogs, and landscapes (the old knowledge).
Doesn't need to carry two different brains or run two different programs at the same time (unlike other methods that just average two models).

Why This Matters

In the real world, this means we can take powerful AI models, teach them new specific jobs (like diagnosing a specific disease or recognizing a specific type of defect in manufacturing), and then use MERGETUNE to ensure they don't lose their general "common sense."

In short: MERGETUNE is like a memory therapist for AI. It takes an AI that has become too specialized and forgotten its roots, and gently guides it back to a state where it is both a world-class expert and a well-rounded generalist, all without needing to re-teach it from scratch.

1. Problem Statement

Vision-Language Models (VLMs) like CLIP possess strong zero-shot generalization capabilities due to pretraining on web-scale data. However, adapting these models to specific downstream tasks via fine-tuning often leads to catastrophic forgetting of the pretrained knowledge, weakening their generalization to novel classes or out-of-distribution (OOD) data.

Existing solutions face two main limitations:

Parameter-Efficient Fine-Tuning (PEFT): Methods like CoOp or adapters mitigate forgetting by updating only lightweight modules, but they often fail to fully preserve pretrained knowledge and do not consistently outperform the zero-shot baseline across diverse datasets.
Model Ensembling/Merging: Techniques that combine zero-shot and fine-tuned weights (e.g., linear interpolation) often fail because the two models may lie in distant regions of the weight space with no low-loss path connecting them. This results in unstable performance or degradation when simply averaging weights.

The authors propose a new paradigm: Continued Fine-Tuning (CFT). Instead of preventing forgetting during the initial adaptation, CFT seeks to recover the forgotten knowledge after a model has already been fine-tuned.

2. Methodology: MERGETUNE

The core of the paper is MERGETUNE, a model-agnostic, post-hoc strategy that uses Linear Mode Connectivity (LMC) to merge a fine-tuned VLM with its zero-shot counterpart.

Core Concept

The method searches for a "continued model" ( $w$ ) that lies on low-loss paths connecting both the zero-shot solution ( $\hat{w}_1$ , e.g., CLIP) and the fine-tuned solution ( $\hat{w}_2$ , e.g., CoOp). If such paths exist, the model $w$ can inherit knowledge from both endpoints.

The Objective Function

The goal is to minimize the expected loss along the interpolation paths between $w$ and both $\hat{w}_1$ and $\hat{w}_2$ :
$w = \arg \min_w \mathbb{E}_{\alpha \sim U[0,1]} \left[ L_1(\hat{w}_1 + \alpha(w - \hat{w}_1)) + L_2(\hat{w}_2 + \alpha(w - \hat{w}_2)) \right]$
Where:

$L_1$ is the pretraining loss (Task 1).
$L_2$ is the downstream task loss (Task 2).
$\alpha$ is an interpolation coefficient.

Key Innovation: Second-Order Surrogate

A major challenge is that $L_1$ requires the massive, inaccessible pretraining dataset (e.g., CLIP's web-scale corpus). To solve this without data replay, the authors approximate the $L_1$ term using a second-order Taylor expansion:

Assume $\hat{w}_1$ is near a local optimum, so the gradient $\nabla L_1(\hat{w}_1) \approx 0$ .
Assume isotropic curvature ( $H_1 \approx \mu I$ ).
This simplifies the term to a proximity regularizer: $R_{Task1} = \lambda \|w - \hat{w}_1\|^2$ .

The final training objective becomes:
$\mathcal{L}(w) = L_2(w) + \lambda \|w - \hat{w}_1\|^2 + \beta \mathbb{E}_{\alpha} [L_2(\hat{w}_2 + \alpha(w - \hat{w}_2))]$

$L_2(w)$ : Ensures the model performs well on the downstream task.
$\lambda \|w - \hat{w}_1\|^2$ : Keeps the model close to the zero-shot solution to preserve general knowledge.
LMC Term ( $\beta$ ): Enforces a low-loss path between the new model and the original fine-tuned model, ensuring task-specific knowledge is retained.

Implementation

Model-Agnostic: It can be applied to any fine-tuned VLM (prompt-based, adapter-based, or full-model) without architectural changes.
Post-Hoc: It takes an existing fine-tuned checkpoint and the zero-shot checkpoint as input and continues training to find the merged solution.
Approximation: The expectation over $\alpha$ is approximated by evaluating a small number of evenly spaced points (e.g., 5 or 10).

3. Key Contributions

New Paradigm (CFT): Shifts the focus from mitigating forgetting during adaptation to recovering forgotten knowledge post-adaptation.
MERGETUNE Algorithm: A learning-based merging method guided by LMC that effectively integrates zero-shot and fine-tuned knowledge.
Replay-Free Constraint: Introduces a second-order surrogate loss to approximate the zero-shot constraint, eliminating the need for inaccessible pretraining data.
Broad Applicability: Demonstrates effectiveness across diverse PEFT methods (CoOp, KgCoOp, MMA, PromptKD) and robust fine-tuning settings.

4. Experimental Results

The authors evaluated MERGETUNE on 11 datasets across four evaluation protocols:

Base-to-Novel Generalization (Few-Shot):
- MERGETUNE improved the harmonic mean (HM) of CoOp by +5.6% on average.
- It consistently outperformed training-free merging baselines like TIES and DARE, which often degraded performance.
- The improvement was inversely proportional to the baseline's performance; methods with more forgetting (e.g., CoOp) saw larger gains.
Cross-Dataset Generalization:
- Models trained on ImageNet and tested on 10 other datasets showed consistent gains.
- MERGETUNE enabled the MMA adapter to surpass the zero-shot CLIP baseline across all evaluated datasets.
Domain Generalization:
- On ImageNet variants (V2, Sketch, Adversarial, Real), MERGETUNE achieved positive gains (+0.30 to +0.87) over baselines, demonstrating enhanced robustness to distribution shifts.
Robust Fine-Tuning (ID-OOD):
- In many-shot settings, MERGETUNE outperformed state-of-the-art ensembling methods (like VRF) with lower inference costs (single model vs. ensemble).
- When combined with weight-space ensembling, it achieved new State-of-the-Art (SOTA) results, outperforming CLIP in all evaluated cases.

5. Significance

Solves the "Forgetting" Trade-off: MERGETUNE successfully breaks the trade-off between strong downstream adaptation and zero-shot generalization, a persistent challenge in VLMs.
Efficiency: It achieves these gains without adding parameters or requiring complex inference-time ensembling, making it practical for deployment.
Theoretical Insight: It validates that even after significant fine-tuning, a low-loss path exists between the adapted model and the zero-shot model, provided the model is guided by LMC constraints.
Generalizability: The method works across different model architectures (ViT-B/16, ViT-L/14, Siglip) and adaptation strategies, suggesting a fundamental improvement in how VLMs are adapted.

In conclusion, MERGETUNE offers a simple yet powerful post-hoc solution to restore the "forgotten" generalization capabilities of fine-tuned Vision-Language Models, setting a new benchmark for robust and generalizable VLM adaptation.