MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Imagine you have a brilliant robot chef. You've trained this chef to make the perfect pizza (Task A). Then, you train a different version of the same chef to make the perfect sushi (Task B). Both are experts in their own right.

Now, you want a single robot that can do both pizza and sushi without needing two separate brains. You try to "merge" the two chefs into one super-chef.

The Problem:
In the world of current AI robots (called Vision-Language-Action or VLA models), this merge usually fails spectacularly. If you try to combine the "pizza brain" and the "sushi brain," the result is a confused robot that can't do either. It might try to put pepperoni on a fish roll, or it might freeze entirely.

Why? The paper authors discovered two main reasons:

The "Brain" Conflict: The part of the robot that sees and understands language (the "Vision-Language Model") gets so specialized for pizza that it forgets how to handle sushi, and vice versa. When you mix them, the instructions clash.
The "Hand" Conflict: The part that actually moves the arms (the "Action Expert") learns to rely on its own internal habits. It's like a pianist who learned to play a specific song by memorizing every finger movement in a chain reaction. If you try to mix that with a jazz player's habits, the chain reaction breaks, and the hands stop working.

The Solution: MergeVLA
The authors created a new robot architecture called MergeVLA. Think of it as building a robot with a modular toolkit instead of a single, rigid brain. Here is how it works, using simple analogies:

1. The "Smart Switchboard" (Task Masks)

Imagine the robot's brain is a massive library. When the robot learns to make pizza, it writes new notes on specific pages. When it learns sushi, it writes on different pages.

Old Way: You try to glue the two books together. The notes overlap, cross out each other, and the story becomes nonsense.
MergeVLA Way: Instead of gluing the books, you put a smart switch on every page. When the robot needs to make pizza, the switch turns on the pizza pages and turns off the sushi pages. When it needs sushi, it flips the switch. This way, the "pizza notes" never fight the "sushi notes." They coexist peacefully in the same book.

2. The "Direct Line" (Cross-Attention Only)

The robot's "hands" (the action expert) used to have a habit of talking to themselves. "I'm holding the knife, so I must move my elbow, which means I must turn my wrist..." This internal chatter made the hands very good at one specific task but terrible at adapting to another.

MergeVLA Way: The authors removed the "internal chatter" (self-attention). Now, the hands only listen to the "Brain" (the Vision-Language part). The Brain says, "Pick up the red block," and the hands just do it. Because the hands aren't overthinking their own internal habits, they can be easily shared between different tasks.

3. The "Magic Detective" (Test-Time Router)

What if you walk up to the robot and say, "Make a sandwich," but you don't tell it which sandwich? How does it know which "switch" to flip?

MergeVLA Way: The robot has a built-in Magic Detective. It looks at the picture of the kitchen and the words you said. It quickly checks its internal "subspace" (a fancy way of saying it looks at the hidden patterns in its brain) to guess: "Ah, this looks like a 'Pizza' task!" or "This looks like 'Sushi'!"
Once the detective guesses the task, it instantly flips the correct switches and grabs the right set of "hands" (the expert head) to do the job. It does this without needing to be retrained or asked for help.

The Results

The team tested this on real robots and simulations:

In Simulation: The merged robot could switch between making a bed, stacking blocks, and pushing objects with 90% success, almost as good as if it had been trained separately for each.
In the Real World: They put it on a real robotic arm. It could pick up cubes, push them, and stack them, even when the colors of the cubes were different from what it was trained on.

The Big Picture

This paper is a breakthrough because it proves we don't need to train a giant, new robot for every single new skill. Instead, we can train small, specialized "experts," merge them together using this smart "switchboard" design, and have one generalist robot that can handle almost anything. It's the difference between having a toolbox with 100 separate tools vs. having one Swiss Army Knife that can instantly transform into the right tool you need.

Here is a detailed technical summary of the paper "Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent" (MergeVLA).

1. Problem Statement

Recent Vision-Language-Action (VLA) models have successfully enabled robots to perform complex manipulation tasks by fine-tuning large Vision-Language Models (VLMs) with robotic demonstrations. However, a critical bottleneck remains: cross-skill generalization.

The Challenge: While VLA models excel in single-task or single-embodiment settings, extending them to handle multiple skills (e.g., stacking, pushing, grasping) within a single unified model is difficult.
The Failure of Standard Merging: Directly applying existing model merging techniques (like Task Arithmetic or TIES) to VLA experts trained on different tasks results in near-zero success rates.
The Core Question: What prevents VLAs from mastering multiple skills in one model, and how can we merge them without catastrophic interference?

2. Methodology: MergeVLA

The authors propose MergeVLA, a merging-oriented architecture designed to preserve mergeability by addressing two specific sources of non-mergeability identified through empirical analysis.

A. Diagnosis of Non-Mergeability

The authors decomposed VLA parameters and identified two root causes for merging failure:

Destructive LoRA Parameter Interference: In the VLM backbone, LoRA adapters fine-tuned for different tasks diverge sharply. They activate largely disjoint subsets of channels (high "selfish ratio"), meaning parameters beneficial for one task are often contradictory or irrelevant for another. Simple averaging corrupts shared vision-language subspaces.
Architectural Incompatibility of Action Experts: Standard action experts (e.g., in VLA-Adapter) use self-attention layers trained from scratch. These layers accumulate strong task-specific dependencies across blocks via self-attention feedback, causing task information to spread globally and breaking modularity. This makes deeper layers irreconcilable under any merging scheme.

B. Architectural Solutions

To solve these issues, MergeVLA introduces three key components:

Task-Specific Masking for VLMs (Addressing LoRA Conflicts):
- Instead of a single global update, MergeVLA applies sparsely activated LoRA adapters via binary task masks ( $S_m$ ).
- A parameter is retained in the merged model for a specific task only if its task-specific update is significant and dominant over the residual difference with the global merge.
- This suppresses conflicting parameters, preserving pretrained visual-language representations and mitigating cross-task interference.
Redesigned Action Expert (Addressing Self-Attention):
- Remove Self-Attention: The action expert is redesigned to contain only cross-attention blocks, eliminating self-attention layers that cause task-specific divergence. This forces the expert to rely on robust, shared VLM features.
- Sigmoid Gating: Replaces the original tanh gating mechanism with a sigmoid gate to ensure VLM signals are always preserved and balanced, preventing the expert from relying solely on scratch-trained parameters.
- Specialization Hierarchy: While shallow blocks can be merged via simple weight averaging, the deepest blocks (the Expert Head, typically the final layer $L$ ) remain unmerged due to strong task specialization. Each task retains its own expert head.
Test-Time Task Router (Addressing Unknown Task Identity):
- In mixed-task scenarios where the task identity is unknown at inference, MergeVLA uses a training-free router.
- Mechanism: It computes hidden states from masked VLM variants and projects them onto the value-based subspaces (derived via SVD of value matrices) of the merged action expert.
- Selection: The router selects the task with the highest response score, activating the corresponding task mask and expert head. This allows the model to adaptively select the correct skill without supervision.

3. Key Contributions

Empirical Analysis: First to systematically identify that VLA merging fails due to both LoRA parameter selfishness in the VLM backbone and self-attention induced coupling in action experts.
MergeVLA Architecture: A novel design that removes self-attention from action experts and employs task masking, enabling effective model merging where previous methods failed.
Unsupervised Task Routing: A mechanism to infer task identity from input observations using value-based subspace alignment, enabling true multi-skill operation without explicit task prompts.
Scalability: Demonstrates that a single merged model can generalize across skills, environments, and even different robotic embodiments.

4. Experimental Results

The authors evaluated MergeVLA on three simulation benchmarks and a real-world robot:

LIBERO (Cross-Skill):
- MergeVLA achieved 90.2% average success rate on the mixed-task setting (using TIES merging), significantly outperforming baselines like OpenVLA (0% when merged) and VLA-Adapter (0% when merged).
- It matched the performance of individually fine-tuned experts (within ~6.5% drop) while using a single unified model.
LIBERO-Plus (Robustness):
- Under visual and linguistic shifts (e.g., background textures, camera viewpoints), MergeVLA showed 13.4% higher success rates than VLA-Adapter, demonstrating superior out-of-distribution (OOD) generalization.
RoboTwin (Cross-Embodiment):
- Evaluated on three different dual-arm robots (Aloha, ARX, Piper). MergeVLA successfully generalized across different hardware morphologies and tasks, achieving 70.7% success in the cross-embodiment/cross-task setting.
Real-World (SO-101 Robot):
- Deployed on a real SO-101 arm for cube picking, pushing, and stacking.
- MergeVLA achieved 90.0% average success rate, matching single-task fine-tuned performance and proving feasibility on physical hardware.

5. Significance

Feasibility of Generalist Agents: This work proves that model merging is not only possible for VLAs but is a scalable path toward building generalist embodied agents, avoiding the need for massive joint retraining.
Efficiency: It enables the reuse of skill components across tasks and embodiments, drastically reducing the memory and computational cost of deploying multi-skill robots.
Architectural Insight: The findings suggest that for robotic control, modularity (removing self-attention) and parameter isolation (masking) are crucial for maintaining the precision required for continuous action generation, distinguishing VLA merging from standard LLM/VLM merging.

In conclusion, MergeVLA bridges the gap between specialized robotic policies and generalist agents by rethinking VLA architecture to be "merge-friendly" by design, achieving state-of-the-art performance in cross-skill and cross-embodiment scenarios.

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

1. The "Smart Switchboard" (Task Masks)

2. The "Direct Line" (Cross-Attention Only)

3. The "Magic Detective" (Test-Time Router)

The Results

The Big Picture

1. Problem Statement

2. Methodology: MergeVLA

A. Diagnosis of Non-Mergeability

B. Architectural Solutions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities