Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

The paper introduces Green-VLA, a five-stage curriculum framework that combines large-scale multimodal pretraining, embodiment-specific adaptation, and reinforcement learning to enable a single generalist policy to robustly control diverse robotic systems, including the Green humanoid, with enhanced safety and long-horizon efficiency.

I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin, A. Letkin, E. Rusakov, A. Silchenko, V. Vorobyov, S. Sobolnikov, A. Postnikov

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you want to teach a robot to be a helpful butler. In the past, you'd have to show it exactly how to pick up a specific cup, then a specific plate, then a specific spoon, one by one. If you gave it a new cup it had never seen, it would likely drop it.

Green-VLA is a new, smarter way to teach robots. Instead of just memorizing millions of specific movements, it teaches the robot to understand the world, learn the general rules of physics, and then practice until it gets really good at the job.

Here is how they did it, broken down into simple steps:

1. The "Five-Stage School" Curriculum

Instead of throwing the robot into the deep end, the researchers put it through a five-step school system. Think of it like a human growing up:

  • Stage 0 (The Baby): The robot starts with a "brain" that already knows a lot about language and pictures (like a smart baby who has read every book in the library). It knows what a "cup" is, but it doesn't know how to hold one yet.
  • Stage 1 (The Explorer): The robot watches millions of videos of people doing things online. It learns that if you push a cup, it slides; if you drop it, it breaks. It learns the "common sense" of how the physical world works.
  • Stage 2 (The Intern): Now, the robot watches videos of other robots (arms, mobile bots, humanoids) doing tasks. It learns that "grabbing" looks different on a robot with a claw versus a robot with fingers, but the goal is the same. It learns to translate between different robot bodies.
  • Stage 3 (The Specialist): The robot is finally assigned its specific body (the "Green" humanoid robot). It practices specifically with its own arms and hands, learning exactly how its joints move.
  • Stage 4 (The Apprentice with a Coach): This is the secret sauce. The robot tries to do a task. If it fails, a "coach" (Reinforcement Learning) doesn't just say "try again." It says, "You were too slow," or "You dropped it because you didn't squeeze hard enough." The robot learns from its mistakes and gets better at long, complicated tasks.

2. The "Universal Translator" for Robot Arms

One of the biggest problems in robotics is that every robot is built differently. One has two arms, one has one, one has wheels, one has legs. Usually, you have to train a separate brain for each one.

Green-VLA uses a Universal Action Space. Imagine a universal remote control. Whether you are controlling a TV, a fan, or a light, the remote uses the same buttons (Volume Up, Power, Channel).

  • Green-VLA translates every robot's specific movements into this "universal language."
  • This means the robot can learn from a dual-arm factory robot and apply that knowledge to a humanoid robot, even if they look totally different. It's like learning to drive a truck and then easily figuring out how to drive a car because you understand the concept of "steering" and "braking."

3. The "Quality Control" Filter

The internet is full of bad data: blurry videos, shaky camera footage, and robots moving weirdly. If you train a robot on bad data, it becomes clumsy.

The team built a DataQA Pipeline (a quality inspector).

  • It acts like a strict film editor. It automatically throws away shaky videos, blurry frames, or clips where the robot isn't moving.
  • It also "smooths out" the good videos. If one robot moves slowly and another moves fast, the system adjusts the speed so they look like they are moving at the same rhythm. This helps the robot learn the pattern of the movement, not just the speed.

4. The "GPS" for Picking Things Up

Imagine you tell the robot, "Pick up the blue bottle of shampoo." But the bottle is hidden behind a box, or it's a brand the robot has never seen before. A normal robot might get confused and grab the wrong thing.

Green-VLA has a special Guidance Module (like a GPS for its hands).

  • Before it even moves, it uses its "eyes" and "brain" to predict exactly where the blue bottle is in 3D space.
  • It then draws an invisible line from its hand to that bottle and guides its movement along that line. This helps it grab the right item even if it's never seen that specific bottle before.

5. The "Safety Net"

When a robot is learning, it might accidentally try to do something dangerous or impossible (like reaching through a table).

  • Green-VLA has an Out-of-Distribution Detector. It's like a safety guard. If the robot starts to move in a way that is "weird" or outside of what it has learned, the system gently nudges it back to a safe path before it crashes or breaks something.

The Result

The final robot, named Green, can:

  • Understand complex instructions like "Clean the table and put the apples in the basket."
  • Use both hands at the same time (bimanual manipulation).
  • Handle new objects it has never seen before.
  • Recover from mistakes (if it drops something, it picks it up again without panicking).

In short: Green-VLA isn't just a robot that memorizes a script. It's a robot that understands the world, speaks a universal language of movement, and learns from a coach to become a reliable, general-purpose helper.