Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Imagine you want to teach a robot to be a helpful butler. In the past, you'd have to show it exactly how to pick up a specific cup, then a specific plate, then a specific spoon, one by one. If you gave it a new cup it had never seen, it would likely drop it.

Green-VLA is a new, smarter way to teach robots. Instead of just memorizing millions of specific movements, it teaches the robot to understand the world, learn the general rules of physics, and then practice until it gets really good at the job.

Here is how they did it, broken down into simple steps:

1. The "Five-Stage School" Curriculum

Instead of throwing the robot into the deep end, the researchers put it through a five-step school system. Think of it like a human growing up:

Stage 0 (The Baby): The robot starts with a "brain" that already knows a lot about language and pictures (like a smart baby who has read every book in the library). It knows what a "cup" is, but it doesn't know how to hold one yet.
Stage 1 (The Explorer): The robot watches millions of videos of people doing things online. It learns that if you push a cup, it slides; if you drop it, it breaks. It learns the "common sense" of how the physical world works.
Stage 2 (The Intern): Now, the robot watches videos of other robots (arms, mobile bots, humanoids) doing tasks. It learns that "grabbing" looks different on a robot with a claw versus a robot with fingers, but the goal is the same. It learns to translate between different robot bodies.
Stage 3 (The Specialist): The robot is finally assigned its specific body (the "Green" humanoid robot). It practices specifically with its own arms and hands, learning exactly how its joints move.
Stage 4 (The Apprentice with a Coach): This is the secret sauce. The robot tries to do a task. If it fails, a "coach" (Reinforcement Learning) doesn't just say "try again." It says, "You were too slow," or "You dropped it because you didn't squeeze hard enough." The robot learns from its mistakes and gets better at long, complicated tasks.

2. The "Universal Translator" for Robot Arms

One of the biggest problems in robotics is that every robot is built differently. One has two arms, one has one, one has wheels, one has legs. Usually, you have to train a separate brain for each one.

Green-VLA uses a Universal Action Space. Imagine a universal remote control. Whether you are controlling a TV, a fan, or a light, the remote uses the same buttons (Volume Up, Power, Channel).

Green-VLA translates every robot's specific movements into this "universal language."
This means the robot can learn from a dual-arm factory robot and apply that knowledge to a humanoid robot, even if they look totally different. It's like learning to drive a truck and then easily figuring out how to drive a car because you understand the concept of "steering" and "braking."

3. The "Quality Control" Filter

The internet is full of bad data: blurry videos, shaky camera footage, and robots moving weirdly. If you train a robot on bad data, it becomes clumsy.

The team built a DataQA Pipeline (a quality inspector).

It acts like a strict film editor. It automatically throws away shaky videos, blurry frames, or clips where the robot isn't moving.
It also "smooths out" the good videos. If one robot moves slowly and another moves fast, the system adjusts the speed so they look like they are moving at the same rhythm. This helps the robot learn the pattern of the movement, not just the speed.

4. The "GPS" for Picking Things Up

Imagine you tell the robot, "Pick up the blue bottle of shampoo." But the bottle is hidden behind a box, or it's a brand the robot has never seen before. A normal robot might get confused and grab the wrong thing.

Green-VLA has a special Guidance Module (like a GPS for its hands).

Before it even moves, it uses its "eyes" and "brain" to predict exactly where the blue bottle is in 3D space.
It then draws an invisible line from its hand to that bottle and guides its movement along that line. This helps it grab the right item even if it's never seen that specific bottle before.

5. The "Safety Net"

When a robot is learning, it might accidentally try to do something dangerous or impossible (like reaching through a table).

Green-VLA has an Out-of-Distribution Detector. It's like a safety guard. If the robot starts to move in a way that is "weird" or outside of what it has learned, the system gently nudges it back to a safe path before it crashes or breaks something.

The Result

The final robot, named Green, can:

Understand complex instructions like "Clean the table and put the apples in the basket."
Use both hands at the same time (bimanual manipulation).
Handle new objects it has never seen before.
Recover from mistakes (if it drops something, it picks it up again without panicking).

In short: Green-VLA isn't just a robot that memorizes a script. It's a robot that understands the world, speaks a universal language of movement, and learns from a coach to become a reliable, general-purpose helper.

1. Problem Statement

Despite rapid advancements in Vision-Language-Action (VLA) models, scaling data and parameters alone fails to resolve core challenges in real-world robotic deployment:

Data Heterogeneity: Robotic datasets vary significantly in observation types, action spaces (joint vs. Cartesian), and sampling rates, making unified training difficult.
Data Quality: Trajectories often suffer from jitter, blur, and inconsistent execution, while scene diversity is often low.
Behavior Cloning (BC) Limitations: Standard BC training ( $L_{BC} = E[\|\pi(s) - a\|^2]$ ) saturates quickly, failing to align policies with long-horizon objectives, task-level rewards, or robustness against Out-of-Distribution (OOD) states.
Embodiment Specificity: Most models are tied to specific robot morphologies, lacking the ability to generalize across diverse embodiments (e.g., from single-arm manipulators to high-DoF humanoids) without architectural changes.
Inference Latency: Explicit reasoning methods (e.g., Chain-of-Thought) often incur high latency, preventing real-time control.

2. Methodology: The Green-VLA Framework

Green-VLA introduces a staged training curriculum and a unified data/control stack to bridge the gap between web-scale pretraining and real-world deployment.

A. Five-Stage Training Curriculum

The framework progresses through five distinct stages to build semantic and physical priors:

L0 (Base VLM): Starts with a foundational Vision-Language Model (e.g., Qwen3-VL or PaliGemma) pretrained on large-scale image/video-text data.
L1 (Web Multimodal Pretraining): Trains on 24M non-robotic internet-scale samples (VQA, spatial reasoning, pointing) to establish physical common sense and object affordances.
R0 (General Robotics Pretraining): Pretrains on >3,000 hours of diverse robotic data (humanoids, mobile manipulators, fixed arms) to learn cross-embodiment invariants and base manipulation skills.
R1 (Embodiment-Specific Adaptation): Fine-tunes the model on high-quality data specific to the target robot (e.g., the Green humanoid) to maximize immediate success rates.
R2 (RL Alignment): Uses Reinforcement Learning to align policies with task rewards, failure recovery, and long-horizon objectives, overcoming BC saturation.

B. Unified Data Framework & Quality Assurance

DataQA Pipeline: A filtering system that scores trajectories based on jitter ( $J$ ), image sharpness ( $S$ ), visual diversity ( $D$ ), and state variance ( $\sigma^2$ ). Low-quality segments are discarded.
Temporal Alignment: Uses optical flow magnitude to normalize execution speeds across datasets. Trajectories are resampled/interpolated so that similar visual changes correspond to similar action increments.
Data Augmentation: Includes synthetic expansion of humanoid data via mirroring (exploiting bilateral symmetry) and time-reversal (for reversible tasks) to increase effective training hours.

C. Unified Action Space & Architecture

Unified Semantic Layout ( $A_u$ ): Instead of naive padding of heterogeneous action spaces, Green-VLA maps all robot actions (joints, Cartesian, grippers) into a fixed 64-dimensional semantic space.
Masked Loss: A binary mask $m_e$ indicates valid dimensions for a specific embodiment. The loss is only computed on valid slots, preventing "spurious gradients" from padding and enabling positive transfer across robots.
Control Prompting: The model is conditioned on a structured prompt $c_e$ specifying the number of arms, hands, control type (joint/Cartesian), and mobility, allowing a single policy to control diverse robots.
Flow-Matching Action Expert: A flow-matching model predicts action chunks in the unified space, optimized for low-latency inference.

D. Advanced Inference Modules

Task Planner (GigaVision): A high-level VLM decomposes complex user goals into atomic subtasks (e.g., "pick item," "place in box") and manages feedback loops for replanning.
Joint Prediction Module (JPM) & Guidance: For unseen objects, a lightweight module predicts a 2D affordance point from text, lifts it to 3D, and uses pseudoinverse guidance (ΠGDM) to bias the flow-matching trajectory toward the target, improving precision in cluttered environments.
OOD Detection: A Gaussian Mixture Model (GMM) monitors state density. If the robot enters a low-density state, the action is corrected via gradient descent to nudge the policy back toward the training distribution.
Speed Conditioning: A scalar factor $v$ allows the model to operate at different temporal resolutions (fine-grained vs. coarse), balancing precision and efficiency.

3. Key Contributions

Quality-First Data Pipeline: A comprehensive DataQA system with temporal alignment and synthetic augmentation (mirroring/time-reversal) that ensures high-quality, embodiment-agnostic training data.
Staged Training Recipe: A proven curriculum (L0→L1→R0→R1→R2) that effectively transfers web-scale priors to robotics, adapts to specific embodiments, and refines performance via RL.
Unified Multi-Embodiment Control: A novel masked action space and prompting mechanism that allows a single policy to control humanoids, dual-arm manipulators, and mobile bases without architectural changes.
RL-Enhanced Long-Horizon Performance: Integration of RL (via Q-function optimization and source distribution optimization) to improve success rates, robustness, and average chain length (ACL) beyond what Behavior Cloning can achieve.
Real-World Humanoid Deployment: Successful deployment on the Green humanoid robot (32 DoF, dexterous hands), demonstrating zero-shot generalization to new embodiments and state-of-the-art performance on bimanual tasks.

4. Experimental Results

Benchmarks (Simpler & CALVIN):
- R0 Stage: Green-VLA outperforms prior foundation models ( $\pi_0$ , GR00T N1, AgiBot GO-1) on the Simpler benchmark using significantly less data (~3,000 hours vs. >10,000 hours).
- R1 Stage: After embodiment fine-tuning, it achieves competitive results with specialized fine-tuned baselines.
- R2 Stage: RL alignment yields the largest gains, improving success rates by 24% on WidowX tasks and significantly increasing Average Chain Length (ACL) on CALVIN, demonstrating superior long-horizon consistency and error recovery.
Humanoid Performance:
- Achieved high success rates (e.g., 98% for pick tasks, 99% for handovers) on the Green humanoid.
- Successfully handled complex, instruction-conditioned tasks like table cleaning and fruit sorting in Out-of-Distribution (OOD) scenes.
E-Commerce Shelf Picking:
- The JPM guidance module significantly boosted success rates for picking specific SKUs in cluttered environments, achieving 95.4% success on in-domain coarse tasks and 72.8% on OOD tasks (compared to ~10% without guidance).

5. Significance

Green-VLA represents a paradigm shift from "scaling data" to "scaling quality and alignment." It demonstrates that:

Unified Action Spaces are critical for generalist robots, allowing knowledge transfer across disparate hardware.
Staged Training is essential to balance semantic grounding (web data) with physical control (robot data).
RL Alignment is necessary to break the saturation limits of Behavior Cloning, particularly for long-horizon and dexterous tasks.
Practical Deployment: The framework is not just a simulation benchmark but a deployable system for complex, high-DoF humanoids, offering a scalable recipe for building generalist robotic intelligence.

The code and project page are available at the provided links, marking a significant step toward practical, general-purpose robotics.