InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Imagine you have a brilliant, well-read librarian who knows everything about the world, from how to fix a car to the history of ancient Rome. Now, imagine you want to teach this librarian to actually do things with their hands, like picking up a cup or opening a drawer.

The problem is, if you just start training them to move their hands, they might forget how to read or lose their ability to understand complex instructions. They might become a great hand-mover but a terrible thinker.

This is the exact challenge the paper "InstructVLA" tackles. The researchers built a new kind of robot brain that acts like a bilingual master chef who can both read a complex recipe (reasoning) and cook the dish perfectly (action) without forgetting how to read.

Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Amnesia" Effect

Current robot brains (called VLA models) are like students who cram for a test. They learn to move their arms to do specific tasks, but in the process, they often "forget" the general knowledge they learned from the internet.

The Old Way: If you ask a standard robot, "Pick up the red thing," it might do it. But if you ask, "I'm thirsty but I don't want soda; grab me something else," it gets confused. It's too focused on the action and has lost its common sense.
The Risk: If you train a robot too hard on moving things, it suffers from "catastrophic forgetting"—it loses its ability to understand language and the world.

2. The Solution: InstructVLA (The "Thinking Doer")

The authors created InstructVLA, a model that keeps the librarian's brain (the Vision-Language Model) intact while adding a specialized "hand" (the Action Expert).

Think of it like a General Contractor and a Specialized Builder:

The General Contractor (The VLM): This is the big brain. It looks at the scene, reads the instructions, and figures out the plan. It says, "Okay, the user wants a drink. The fridge is closed. I need to open the fridge, find the juice, and pour it." It never stops thinking or learning.
The Specialized Builder (The Action Expert): This is the muscle. It doesn't worry about why we are doing it; it just executes the precise movements to open the fridge door or pour the juice.

3. The Secret Sauce: "Mixture of Experts" (The Traffic Controller)

How do you get these two to work together without them fighting? The paper uses a clever trick called Mixture-of-Experts (MoE).

Imagine a Traffic Controller at a busy airport.

Sometimes the plane needs to talk to the tower (Reasoning). The controller points the signal to the "Language" runway.
Sometimes the plane needs to land (Action). The controller points the signal to the "Action" runway.
The Magic: The controller can switch between these runways instantly. It allows the robot to say, "I see a spoon," (Reasoning) and then immediately switch to "Grab the spoon," (Action) without the brain getting confused or the hands getting stuck.

4. The Training: "The 650,000-Step Bootcamp"

To teach this robot, they didn't just show it videos of robots moving. They created a massive, custom dataset called VLA-IT (Vision-Language-Action Instruction Tuning).

The Analogy: Imagine teaching a child not just by saying "Pick up the cup," but by having them explain why they are picking it up, describe the cup, and then do it.
They took 650,000 examples of robots working and added layers of "thinking" to them. They taught the robot to describe the scene, answer questions about it, and then plan its move.
Two-Stage Training:
1. Stage 1 (The Muscle Memory): They taught the robot's hands how to move based on vague descriptions, without messing up the brain.
2. Stage 2 (The Brain-Hand Sync): They taught the whole system to switch between talking and moving seamlessly, using the "Traffic Controller" (MoE) to decide what to do next.

5. The Results: From "Dumb Robot" to "Smart Assistant"

The paper tested this new robot on a new benchmark called SimplerEnv-Instruct. This isn't just about picking up a specific block; it's about understanding tricky instructions.

The Test: "I want to clean the table. Pick a suitable tool for me."
Old Robots: Would likely grab a random object or fail because they don't understand "cleaning" or "suitable tool."
InstructVLA: Looks at the scene, realizes a sponge is the tool for cleaning, and picks it up. It outperformed previous state-of-the-art robots by a huge margin (96% better than the next best in some tests!).

Summary

InstructVLA is like giving a robot a permanent memory of how the world works while giving it dexterous hands. It doesn't just follow orders; it understands them. It can look at a messy kitchen, figure out what needs to be done, explain its plan, and then execute the task, all without forgetting how to read a book or understand a joke.

It bridges the gap between thinking (understanding the world) and doing (manipulating the world), making robots that are not just tools, but true assistants.

1. Problem Statement

Current Vision-Language-Action (VLA) models face a fundamental trade-off: they often sacrifice the flexible multimodal reasoning capabilities of pre-trained Vision-Language Models (VLMs) to achieve precise robotic manipulation, or conversely, fail to translate reasoning into effective action generation.

Catastrophic Forgetting: Fine-tuning VLMs on manipulation data often erodes their general multimodal understanding and reasoning abilities.
Data Scarcity: There is a lack of high-quality datasets that combine rich multimodal supervision (reasoning, scene understanding) with precise manipulation actions.
Methodological Gaps: Existing approaches struggle to bridge the gap between high-level embodied reasoning (e.g., "clean the table") and low-level action generation without relying on rigid, hand-crafted planning heuristics or suffering from domain gaps.

The core challenge is: How can we acquire manipulation skills without eroding the VLM's multimodal reasoning, and how can such reasoning, in turn, enhance manipulation?

2. Methodology: InstructVLA

The authors propose InstructVLA, an end-to-end VLA model that preserves the reasoning capabilities of large VLMs while delivering state-of-the-art manipulation performance. The approach relies on a novel training paradigm called Vision-Language-Action Instruction Tuning (VLA-IT).

A. Architecture

InstructVLA unifies language generation and action generation within a single framework:

Unified VLM Backbone: Built on a compact VLM (Eagle2-2B), the model generates both textual responses (for reasoning) and latent action representations.
Learnable Action Queries: The model introduces $N$ learnable action queries that attend to the VLM's hidden states to extract task-relevant latent actions ( $C$ ).
Mixture-of-Experts (MoE) Adaptation: To seamlessly switch between reasoning and execution, the model employs a MoE design using LoRA (Low-Rank Adaptation) modules.
- A Scalar Head predicts gating coefficients ( $\lambda$ ) to dynamically weight the outputs of different experts (Language Adapter vs. Action Adapter) based on the input context.
- This allows the model to alternate between generating text (reasoning) and latent actions without catastrophic interference.
Flow-Based Action Expert: A separate, lightweight flow-matching model decodes the latent actions into continuous robot control signals. It takes image features (from DINOv2), latent actions, and optional proprioception as input, using a transformer architecture with block-wise causal attention.

B. Training Paradigm (Two-Stage)

To prevent catastrophic forgetting and ensure stable optimization, InstructVLA uses a two-stage training recipe:

Stage 1: Action Pretraining:
- Trains the "Action Expert" and the latent action queries using heterogeneous manipulation data.
- The model learns to predict both actions and Language Motion (textual descriptions of low-level movements) to distill VLM knowledge into the action space.
- Only the action LoRA adapter and latent action embeddings are tuned, preserving the VLM backbone's pre-trained knowledge.
Stage 2: Vision-Language-Action Instruction Tuning (VLA-IT):
- Unifies language and latent action generation via the MoE framework.
- Trained on a curated 650K-sample VLA-IT dataset (containing diverse instructions, scene captions, and QA pairs) alongside standard multimodal corpora.
- The MoE module (Language LoRA + Action LoRA + Scalar Head) is the only trainable component, totaling ~220M parameters.

C. Inference Strategies

Decoupled Decoding: Textual responses are generated via greedy search until the first action token appears; remaining action queries are decoded in parallel to reduce latency.
Caching: The model caches textual outputs and latent actions across steps to minimize redundant VLM forward passes, significantly improving inference speed compared to Chain-of-Thought (CoT) baselines.

3. Key Contributions

Model Architecture (InstructVLA): A unified VLA model that integrates autoregressive VLM language generation with flow-based action generation. It effectively bridges high-level reasoning and low-level control using MoE adaptation, preventing catastrophic forgetting.
Dataset & Benchmark:
- VLA-IT Dataset: A curated dataset of 650K human-robot interactions with diverse annotations (scenario captions, command rewriting, context creation, and QA) to support instruction following and reasoning transfer.
- SimplerEnv-Instruct: An 80-task benchmark designed to evaluate zero-shot instruction generalization and situated reasoning (e.g., decomposing complex goals, handling OOD objects, and multilingual instructions).
Training Strategy: A principled two-stage training pipeline that decouples low-level control learning from the VLM backbone, allowing the model to retain strong multimodal capabilities while mastering manipulation.

4. Experimental Results

A. Multimodal Understanding

InstructVLA (Generalist) maintains performance comparable to its base VLM (Eagle2) and other strong VLMs (Bunny, Qwen2-VL) on benchmarks like MMMU, MMStar, and TextVQA.
Unlike baselines like OpenVLA (fine-tuned) which suffer from catastrophic forgetting (dropping to near-zero on multimodal benchmarks), InstructVLA preserves its reasoning and VQA capabilities.

B. Robotic Manipulation (SimplerEnv)

In-domain: InstructVLA (Expert) achieves a 33% improvement over SpatialVLA on standard SimplerEnv tasks.
Generalization (SimplerEnv-Instruct):
- Outperforms fine-tuned OpenVLA by 96%.
- Surpasses an action expert aided by GPT-4o (as an external "System 2" for instruction rewriting) by 29%.
- Demonstrates superior performance in Situated Reasoning tasks (e.g., "Pick a suitable tool for cleaning") where baselines fail to infer intent.

C. Real-World Deployment

Tested on WidowX-250 (zero-shot) and Franka Research 3 (few-shot) robots.
In reasoning-heavy tasks (e.g., math problems requiring object selection, tool-use inference), InstructVLA outperforms OpenVLA by 41.7% (few-shot) and 46.7% (zero-shot).
Shows robustness in OOD lighting conditions and novel object recognition.

5. Significance and Impact

Bridging Reasoning and Action: InstructVLA demonstrates that multimodal reasoning can be directly leveraged to improve manipulation performance without sacrificing the VLM's general intelligence.
Efficiency: The MoE architecture and caching strategies enable efficient inference, making the model viable for real-time closed-loop control.
Generalization: By preserving the VLM's pre-trained knowledge, the model exhibits strong zero-shot generalization to novel objects, languages, and complex instructions, addressing a critical bottleneck in current embodied AI.
Open Science: The release of the VLA-IT dataset and SimplerEnv-Instruct benchmark provides a standardized, reproducible framework for evaluating instruction-following and reasoning in robotics.

In conclusion, InstructVLA represents a significant step toward steerable, intuitive, and generalizable human-robot interaction, proving that a single model can effectively master both complex multimodal reasoning and precise physical manipulation.