ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

ROCKET is a novel framework that enhances spatially-aware Vision-Language-Action models by employing a shared projector to align multiple residual streams between 2D VLAs and 3D vision foundation models, thereby resolving gradient conflicts and achieving state-of-the-art robotic performance with minimal computational overhead.

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to cook. You give it a recipe (the language instruction) and show it a video of a human chef (the visual input). The robot needs to understand not just what the ingredients are, but where they are in 3D space, how heavy the pan is, and exactly how to move its arm to flip a pancake without burning it.

This is the job of Vision-Language-Action (VLA) models. They are the "brains" of modern robots. But here's the problem: most of these robots were trained on flat, 2D photos (like Instagram pictures). They are great at recognizing a cat, but they struggle to understand that the cat is sitting on a table, not floating in the air, or that a cup is behind a book. They lack "spatial sense."

To fix this, researchers usually try to "teach" the robot by showing it a 3D expert model (like a super-smart depth-sensing camera) and saying, "Hey, look at what the expert sees, and try to think like them."

The Problem: The "Too Many Teachers" Confusion

Previous methods tried to do this by picking one specific layer of the robot's brain to copy from the expert.

  • The Analogy: Imagine you are trying to learn to play the piano. Your teacher tells you, "Just copy my hand movements from the 10th measure of the song."
  • The Issue: Sometimes the 10th measure is perfect. Sometimes the 20th is better. If you pick the wrong one, you learn nothing. If you try to copy every measure at once using different teachers for each, your hands get confused. Your left hand tries to copy Teacher A, while your right hand copies Teacher B, and they start fighting each other. In AI terms, this is called gradient interference—the robot's brain gets conflicting signals and stops learning.

The Solution: ROCKET

The authors of this paper created a new method called ROCKET. Think of ROCKET as a brilliant coach who solves the "confused student" problem with three clever tricks.

1. The "Shared Translator" (Shared Projector)

Instead of giving the robot a different translator for every layer of its brain, ROCKET gives it one single, super-smart translator that works for the whole brain.

  • The Analogy: Imagine you are learning a foreign language. Instead of hiring a different translator for every sentence (who might all speak slightly different dialects and confuse you), you hire one master translator who speaks the language perfectly. This translator helps you understand the entire conversation, from the greeting to the goodbye, using a consistent set of rules.
  • Why it works: Because the translator is the same for every layer, the robot's brain doesn't get conflicting signals. All the learning signals point in the same direction, making the robot learn faster and more stably.

2. The "Matryoshka Doll" Strategy (Sparse Activation)

Here is a tricky part: The robot's brain has "shallow" layers (which see simple things like edges and colors) and "deep" layers (which understand complex concepts like "a cup is full").

  • The Problem: The shallow layers are easy to learn and might try to dominate the learning process, ignoring the complex 3D stuff in the deep layers.
  • The Analogy: Think of a Matryoshka doll (Russian nesting doll). The small dolls inside are simple; the big outer dolls are complex. ROCKET uses a strategy where the "small" (shallow) layers only get to use a tiny part of the translator's brain. The "big" (deep) layers get to use the whole translator.
  • Why it works: This forces the shallow layers to learn the basics quickly without hogging the spotlight, while giving the deep layers the full power they need to understand complex 3D geometry. It balances the workload perfectly.

3. The "Residual Stream" View

The paper looks at how information flows through the robot's brain like water flowing down a river with small waterfalls (residuals). ROCKET aligns the "waterfalls" of the robot with the "waterfalls" of the expert, ensuring the water flows smoothly from start to finish without getting stuck.

The Results: Fast, Cheap, and Smart

The best part about ROCKET is that it's incredibly efficient.

  • Speed: It learns much faster than previous methods.
  • Cost: It requires only 4% of the computer power (compute budget) that other top-tier methods need. It's like getting a Ferrari's performance with a bicycle's energy cost.
  • Performance: On standard robot tests (like the LIBERO benchmark), ROCKET achieved a 98.5% success rate, beating almost every other method, including those that use expensive 3D sensors.

Summary

ROCKET is a new way to teach robots how to "see" in 3D. Instead of confusing the robot with too many different teachers, it uses one consistent translator and a smart balancing act (the Matryoshka strategy) to ensure the robot learns both simple and complex spatial skills efficiently. It's a simple, scalable, and highly effective way to give robots the spatial awareness they need to navigate our physical world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →