Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

This paper addresses catastrophic forgetting in multimodal large language models under real-world scenario shifts by introducing the MSVQA dataset and proposing UNIFIER, a continual learning framework that utilizes Vision Representation Expansion and Vision Consistency Constraint to achieve significant performance improvements in cross-scenario visual tasks.

Kai Jiang, Siqi Huang, Xiangyu Chen, Jiawei Shao, Hongyuan Zhang, Ping Luo, Xuelong Li

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, super-smart robot assistant named MLLM (Multimodal Large Language Model). This robot can look at pictures and answer questions about them, like "How many birds are in this photo?" or "What color is that car?"

Right now, this robot is great at its job, but it has a serious case of amnesia.

The Problem: The "One-Task" Robot

Currently, if you teach this robot how to spot airplanes in the sky (High Altitude), it becomes an expert. But the moment you show it a picture from underwater and ask it to find fish, it forgets everything it knew about airplanes. It gets confused, mixes up the fish with the planes, and starts making mistakes.

In the real world, our devices (like drones, underwater cameras, or security cameras) see the world from all sorts of angles:

  • High Altitude: Looking down from a satellite (tiny cars look like dots).
  • Underwater: Murky, blue, and distorted.
  • Low Altitude: A drone flying just above a busy street.
  • Indoor: A first-person view of someone cooking in a kitchen.

If a robot can't learn these different "worlds" without forgetting the previous ones, it's useless for real-life applications. This is called Catastrophic Forgetting.

The Solution: The "UNIFIER" Backpack

The authors of this paper created a new system called UNIFIER. Think of UNIFIER not as a single brain, but as a backpack with multiple specialized pockets.

Here is how it works, using a simple analogy:

1. The "Pocket" System (Vision Representation Expansion)

Imagine the robot has a main brain, but it also has a backpack.

  • When it learns about Airplanes, it puts a special "Airplane Pocket" in the backpack. It fills this pocket with all the rules for spotting planes.
  • When it learns about Fish, it adds a "Fish Pocket." It fills this with rules for spotting fish.
  • The Magic: These pockets don't mix. The robot can look into the "Airplane Pocket" to remember planes, and the "Fish Pocket" to remember fish, without the two ideas getting jumbled together. This stops the robot from forgetting the old stuff when learning the new stuff.

2. The "Team Captain" (Vision Consistency Constraint)

You might ask: "If the pockets are separate, how does the robot know that a 'bird' in the sky is similar to a 'bird' in a park?"

This is where the Team Captain comes in. Even though the pockets are separate, the Team Captain ensures that the way the robot thinks about the world stays consistent.

  • It gently reminds the robot: "Hey, the way you see a car in the sky is similar to how you see a car on the ground. Don't change your whole personality just because the background changed."
  • This allows the robot to share knowledge between pockets. Learning about cars in the sky actually helps it get better at spotting cars underwater, without causing confusion.

The New Test: MSVQA

To prove this works, the researchers built a giant test called MSVQA.

  • Instead of just asking simple questions like "Is this a cat?", they created a chaotic, real-world test.
  • They threw images at the robot from satellites, underwater cameras, drones, and kitchens.
  • They asked complex questions like "Count the planes," "Find the specific model of the plane," or "Locate the fish hiding in the coral."

The Results: A Super-Student

When they tested the robot with the old methods, it failed miserably. As soon as it learned a new scene, it forgot the old one.

But with UNIFIER:

  • It remembers everything: It kept getting better at spotting airplanes even after learning about fish.
  • It gets smarter: By seeing different perspectives, it actually improved its overall skills.
  • It's fast: The new backpack system didn't make the robot slow; it was just as fast as before.

The Takeaway

Think of UNIFIER as a student who doesn't just memorize one textbook. Instead, they have a library of specialized guides for every different environment they visit. When they go to the ocean, they open the ocean guide. When they go to the sky, they open the sky guide. But because they are the same student, they can use what they learned in the ocean to help them understand the sky, and they never forget what they learned yesterday.

This paper is a big step toward making AI assistants that can truly live in our messy, changing, multi-scenario world without losing their minds.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →