MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

MergeVLA is a merging-oriented Vision-Language-Action architecture that overcomes the non-mergeability of existing VLA experts by introducing task-masked sparse LoRA adapters and cross-attention-only action experts, enabling a single generalist agent to robustly handle diverse tasks and embodiments without performance degradation.

Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, Yadan Luo

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you have a brilliant robot chef. You've trained this chef to make the perfect pizza (Task A). Then, you train a different version of the same chef to make the perfect sushi (Task B). Both are experts in their own right.

Now, you want a single robot that can do both pizza and sushi without needing two separate brains. You try to "merge" the two chefs into one super-chef.

The Problem:
In the world of current AI robots (called Vision-Language-Action or VLA models), this merge usually fails spectacularly. If you try to combine the "pizza brain" and the "sushi brain," the result is a confused robot that can't do either. It might try to put pepperoni on a fish roll, or it might freeze entirely.

Why? The paper authors discovered two main reasons:

  1. The "Brain" Conflict: The part of the robot that sees and understands language (the "Vision-Language Model") gets so specialized for pizza that it forgets how to handle sushi, and vice versa. When you mix them, the instructions clash.
  2. The "Hand" Conflict: The part that actually moves the arms (the "Action Expert") learns to rely on its own internal habits. It's like a pianist who learned to play a specific song by memorizing every finger movement in a chain reaction. If you try to mix that with a jazz player's habits, the chain reaction breaks, and the hands stop working.

The Solution: MergeVLA
The authors created a new robot architecture called MergeVLA. Think of it as building a robot with a modular toolkit instead of a single, rigid brain. Here is how it works, using simple analogies:

1. The "Smart Switchboard" (Task Masks)

Imagine the robot's brain is a massive library. When the robot learns to make pizza, it writes new notes on specific pages. When it learns sushi, it writes on different pages.

  • Old Way: You try to glue the two books together. The notes overlap, cross out each other, and the story becomes nonsense.
  • MergeVLA Way: Instead of gluing the books, you put a smart switch on every page. When the robot needs to make pizza, the switch turns on the pizza pages and turns off the sushi pages. When it needs sushi, it flips the switch. This way, the "pizza notes" never fight the "sushi notes." They coexist peacefully in the same book.

2. The "Direct Line" (Cross-Attention Only)

The robot's "hands" (the action expert) used to have a habit of talking to themselves. "I'm holding the knife, so I must move my elbow, which means I must turn my wrist..." This internal chatter made the hands very good at one specific task but terrible at adapting to another.

  • MergeVLA Way: The authors removed the "internal chatter" (self-attention). Now, the hands only listen to the "Brain" (the Vision-Language part). The Brain says, "Pick up the red block," and the hands just do it. Because the hands aren't overthinking their own internal habits, they can be easily shared between different tasks.

3. The "Magic Detective" (Test-Time Router)

What if you walk up to the robot and say, "Make a sandwich," but you don't tell it which sandwich? How does it know which "switch" to flip?

  • MergeVLA Way: The robot has a built-in Magic Detective. It looks at the picture of the kitchen and the words you said. It quickly checks its internal "subspace" (a fancy way of saying it looks at the hidden patterns in its brain) to guess: "Ah, this looks like a 'Pizza' task!" or "This looks like 'Sushi'!"
  • Once the detective guesses the task, it instantly flips the correct switches and grabs the right set of "hands" (the expert head) to do the job. It does this without needing to be retrained or asked for help.

The Results

The team tested this on real robots and simulations:

  • In Simulation: The merged robot could switch between making a bed, stacking blocks, and pushing objects with 90% success, almost as good as if it had been trained separately for each.
  • In the Real World: They put it on a real robotic arm. It could pick up cubes, push them, and stack them, even when the colors of the cubes were different from what it was trained on.

The Big Picture

This paper is a breakthrough because it proves we don't need to train a giant, new robot for every single new skill. Instead, we can train small, specialized "experts," merge them together using this smart "switchboard" design, and have one generalist robot that can handle almost anything. It's the difference between having a toolbox with 100 separate tools vs. having one Swiss Army Knife that can instantly transform into the right tool you need.