Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

Imagine you have a brilliant, super-smart robot assistant named MLLM (Multimodal Large Language Model). This robot can look at pictures and answer questions about them, like "How many birds are in this photo?" or "What color is that car?"

Right now, this robot is great at its job, but it has a serious case of amnesia.

The Problem: The "One-Task" Robot

Currently, if you teach this robot how to spot airplanes in the sky (High Altitude), it becomes an expert. But the moment you show it a picture from underwater and ask it to find fish, it forgets everything it knew about airplanes. It gets confused, mixes up the fish with the planes, and starts making mistakes.

In the real world, our devices (like drones, underwater cameras, or security cameras) see the world from all sorts of angles:

High Altitude: Looking down from a satellite (tiny cars look like dots).
Underwater: Murky, blue, and distorted.
Low Altitude: A drone flying just above a busy street.
Indoor: A first-person view of someone cooking in a kitchen.

If a robot can't learn these different "worlds" without forgetting the previous ones, it's useless for real-life applications. This is called Catastrophic Forgetting.

The Solution: The "UNIFIER" Backpack

The authors of this paper created a new system called UNIFIER. Think of UNIFIER not as a single brain, but as a backpack with multiple specialized pockets.

Here is how it works, using a simple analogy:

1. The "Pocket" System (Vision Representation Expansion)

Imagine the robot has a main brain, but it also has a backpack.

When it learns about Airplanes, it puts a special "Airplane Pocket" in the backpack. It fills this pocket with all the rules for spotting planes.
When it learns about Fish, it adds a "Fish Pocket." It fills this with rules for spotting fish.
The Magic: These pockets don't mix. The robot can look into the "Airplane Pocket" to remember planes, and the "Fish Pocket" to remember fish, without the two ideas getting jumbled together. This stops the robot from forgetting the old stuff when learning the new stuff.

2. The "Team Captain" (Vision Consistency Constraint)

You might ask: "If the pockets are separate, how does the robot know that a 'bird' in the sky is similar to a 'bird' in a park?"

This is where the Team Captain comes in. Even though the pockets are separate, the Team Captain ensures that the way the robot thinks about the world stays consistent.

It gently reminds the robot: "Hey, the way you see a car in the sky is similar to how you see a car on the ground. Don't change your whole personality just because the background changed."
This allows the robot to share knowledge between pockets. Learning about cars in the sky actually helps it get better at spotting cars underwater, without causing confusion.

The New Test: MSVQA

To prove this works, the researchers built a giant test called MSVQA.

Instead of just asking simple questions like "Is this a cat?", they created a chaotic, real-world test.
They threw images at the robot from satellites, underwater cameras, drones, and kitchens.
They asked complex questions like "Count the planes," "Find the specific model of the plane," or "Locate the fish hiding in the coral."

The Results: A Super-Student

When they tested the robot with the old methods, it failed miserably. As soon as it learned a new scene, it forgot the old one.

But with UNIFIER:

It remembers everything: It kept getting better at spotting airplanes even after learning about fish.
It gets smarter: By seeing different perspectives, it actually improved its overall skills.
It's fast: The new backpack system didn't make the robot slow; it was just as fast as before.

The Takeaway

Think of UNIFIER as a student who doesn't just memorize one textbook. Instead, they have a library of specialized guides for every different environment they visit. When they go to the ocean, they open the ocean guide. When they go to the sky, they open the sky guide. But because they are the same student, they can use what they learned in the ocean to help them understand the sky, and they never forget what they learned yesterday.

This paper is a big step toward making AI assistants that can truly live in our messy, changing, multi-scenario world without losing their minds.

1. Problem Definition

The paper addresses the challenge of Catastrophic Forgetting in Multimodal Large Language Models (MLLMs) when deployed on devices that encounter continuous, shifting real-world scenarios.

The Gap: Existing Continual Learning (CL) research primarily focuses on language models or single-modality visual tasks. It often neglects the visual component's vulnerability to "scenario shifts" (e.g., changes in perspective, lighting, environment like underwater vs. aerial vs. indoor).
The Challenge: When an MLLM learns a new visual scenario (e.g., underwater), it often forgets how to perform tasks in previously learned scenarios (e.g., high-altitude remote sensing). This is exacerbated by the fact that real-world data streams involve complex backgrounds, fine-grained object detection, and varying perspectives, which differ significantly from standard VQA datasets that rely on simple, fixed-scenario queries.
The Goal: To develop an MLLM that can continuously adapt to new scenarios without forgetting prior knowledge, while simultaneously accumulating knowledge within seen scenarios to improve performance over time.

2. Methodology: UNIFIER

The authors propose UNIFIER (mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives), a framework designed to handle visual discrepancies across scenarios. It consists of two core components:

A. Vision Representation Expansion (VRE)

Concept: Instead of using a single shared backbone (which leads to parameter interference) or completely isolated branches (which prevents knowledge sharing and requires complex routing), UNIFIER expands the vision encoder.
Implementation:
- A Cross-Scenario Representation (CSR) module is inserted into each vision block of the encoder.
- The CSR module contains $K$ parallel branches (where $K$ is the number of scenarios) and a Projector.
- During training for a specific scenario, only the corresponding branch is trainable, while others are frozen. This isolates parameters to prevent interference.
- Unified Projection: All branches project their features into a shared subspace via the projector. This allows the model to generate a unified visual representation from a single inference pass, regardless of which scenario branch was active during training. This avoids the inference overhead of routing networks.

B. Vision Consistency Constraint (VCC)

Concept: While VRE isolates parameters, learning new scenarios can still cause "feature drift," where the unified representation shifts significantly, breaking consistency with previous tasks. Standard distillation (aligning intermediate logits) is too rigid and limits model plasticity.
Implementation:
- The authors introduce a soft constraint based on Relative Entropy (KL Divergence) rather than strict $\ell_2$ distance.
- A Scenario Prototype ( $\mu_l$ ) is calculated as the mean of features across all $K$ branches for a given input.
- The loss function penalizes the global changes of the unified representation by aligning the mean vectors of individual branches and the projector outputs with this prototype and the previous model's state.
- This approach allows the model to reorganize features across channels (maintaining plasticity) while preventing drastic global shifts (maintaining stability).

3. Key Contributions

MSVQA Dataset:
- The authors constructed a new benchmark dataset, MSVQA (Multi-Scenario Visual Question Answering), specifically designed to evaluate CL in MLLMs.
- It covers four distinct scenarios: High-altitude (remote sensing), Underwater, Low-altitude (drone), and Indoor (first-person).
- The dataset includes complex tasks beyond simple counting, such as fine-grained visual grounding, object localization with bounding boxes, and action speculation, reflecting real-world data stream complexity.
UNIFIER Framework:
- Proposes a novel architecture that balances parameter isolation (via VRE) with representation alignment (via VCC).
- Achieves mutual enhancement across scenarios without the inference cost of routing mechanisms or the memory cost of storing vast amounts of image data (exemplar-free).
State-of-the-Art Performance:
- Demonstrated significant improvements over existing CL methods (including rehearsal-based and exemplar-free approaches) in preventing catastrophic forgetting and enabling knowledge accumulation.

4. Experimental Results

Experiments were conducted on the MSVQA dataset using the Qwen2.5VL-3B model (with validation on 7B and 4B models).

Performance Gains (20-step setting):
- Compared to the state-of-the-art method QUAD, UNIFIER improved the last-step VQA scores by 2.70% to 10.62%.
- It improved the last-step F1 scores (for object detection) by 3.40% to 7.69%.
- Notably, UNIFIER showed the ability to accumulate knowledge in seen scenarios (performance improves as more steps are learned) while adapting to unseen ones, a capability lacking in most baselines.
Efficiency:
- Inference Cost: The VRE module adds negligible inference overhead. The Time To First Token (TTFT) increases by only ~10ms, and Time Per Output Token (TPOT) remains unchanged because the text decoder is unaffected.
- Parameter Growth: The parameter increase is minimal (e.g., ~6.4% for Qwen2.5VL-3B with 4 scenarios), and the growth rate slows as more scenarios are added, making it suitable for on-device deployment.
Ablation Studies:
- Confirmed that both VRE and VCC are essential. Replacing VCC with strict $\ell_2$ constraints reduced performance, proving the necessity of soft constraints for model plasticity.
- Removing the projector (simple feature addition) also degraded performance, highlighting the importance of the projection layer for unifying representations.

5. Significance

This work is significant for the deployment of on-device AI. It moves beyond theoretical continual learning to address the practical reality of dynamic visual environments. By solving the visual forgetting problem in MLLMs, UNIFIER enables devices (like drones, underwater robots, or AR glasses) to continuously learn and adapt to new environments without requiring retraining from scratch or massive storage of historical data. The introduction of the MSVQA dataset provides a crucial benchmark for future research in multimodal continual learning under realistic, shifting conditions.