ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Imagine you are teaching a robot to cook. You give it a recipe (the language instruction) and show it a video of a human chef (the visual input). The robot needs to understand not just what the ingredients are, but where they are in 3D space, how heavy the pan is, and exactly how to move its arm to flip a pancake without burning it.

This is the job of Vision-Language-Action (VLA) models. They are the "brains" of modern robots. But here's the problem: most of these robots were trained on flat, 2D photos (like Instagram pictures). They are great at recognizing a cat, but they struggle to understand that the cat is sitting on a table, not floating in the air, or that a cup is behind a book. They lack "spatial sense."

To fix this, researchers usually try to "teach" the robot by showing it a 3D expert model (like a super-smart depth-sensing camera) and saying, "Hey, look at what the expert sees, and try to think like them."

The Problem: The "Too Many Teachers" Confusion

Previous methods tried to do this by picking one specific layer of the robot's brain to copy from the expert.

The Analogy: Imagine you are trying to learn to play the piano. Your teacher tells you, "Just copy my hand movements from the 10th measure of the song."
The Issue: Sometimes the 10th measure is perfect. Sometimes the 20th is better. If you pick the wrong one, you learn nothing. If you try to copy every measure at once using different teachers for each, your hands get confused. Your left hand tries to copy Teacher A, while your right hand copies Teacher B, and they start fighting each other. In AI terms, this is called gradient interference—the robot's brain gets conflicting signals and stops learning.

The Solution: ROCKET

The authors of this paper created a new method called ROCKET. Think of ROCKET as a brilliant coach who solves the "confused student" problem with three clever tricks.

1. The "Shared Translator" (Shared Projector)

Instead of giving the robot a different translator for every layer of its brain, ROCKET gives it one single, super-smart translator that works for the whole brain.

The Analogy: Imagine you are learning a foreign language. Instead of hiring a different translator for every sentence (who might all speak slightly different dialects and confuse you), you hire one master translator who speaks the language perfectly. This translator helps you understand the entire conversation, from the greeting to the goodbye, using a consistent set of rules.
Why it works: Because the translator is the same for every layer, the robot's brain doesn't get conflicting signals. All the learning signals point in the same direction, making the robot learn faster and more stably.

2. The "Matryoshka Doll" Strategy (Sparse Activation)

Here is a tricky part: The robot's brain has "shallow" layers (which see simple things like edges and colors) and "deep" layers (which understand complex concepts like "a cup is full").

The Problem: The shallow layers are easy to learn and might try to dominate the learning process, ignoring the complex 3D stuff in the deep layers.
The Analogy: Think of a Matryoshka doll (Russian nesting doll). The small dolls inside are simple; the big outer dolls are complex. ROCKET uses a strategy where the "small" (shallow) layers only get to use a tiny part of the translator's brain. The "big" (deep) layers get to use the whole translator.
Why it works: This forces the shallow layers to learn the basics quickly without hogging the spotlight, while giving the deep layers the full power they need to understand complex 3D geometry. It balances the workload perfectly.

3. The "Residual Stream" View

The paper looks at how information flows through the robot's brain like water flowing down a river with small waterfalls (residuals). ROCKET aligns the "waterfalls" of the robot with the "waterfalls" of the expert, ensuring the water flows smoothly from start to finish without getting stuck.

The Results: Fast, Cheap, and Smart

The best part about ROCKET is that it's incredibly efficient.

Speed: It learns much faster than previous methods.
Cost: It requires only 4% of the computer power (compute budget) that other top-tier methods need. It's like getting a Ferrari's performance with a bicycle's energy cost.
Performance: On standard robot tests (like the LIBERO benchmark), ROCKET achieved a 98.5% success rate, beating almost every other method, including those that use expensive 3D sensors.

Summary

ROCKET is a new way to teach robots how to "see" in 3D. Instead of confusing the robot with too many different teachers, it uses one consistent translator and a smart balancing act (the Matryoshka strategy) to ensure the robot learns both simple and complex spatial skills efficiently. It's a simple, scalable, and highly effective way to give robots the spatial awareness they need to navigate our physical world.

1. Problem Statement

Vision-Language-Action (VLA) models have achieved significant success in instruction-following robotic manipulation. However, most state-of-the-art VLAs are pretrained on 2D imagery and lack robust 3D spatial understanding. This limitation causes them to fail in tasks requiring precise geometry, viewpoint changes, or fine-grained spatial reasoning.

To address this, researchers have attempted representation alignment, where a VLA (student) is guided by a strong 3D vision foundation model (teacher). Existing approaches typically align representations at a single layer. This approach suffers from two main issues:

Layer Sensitivity: Performance is highly sensitive to the specific layer chosen for alignment, often requiring inefficient post-hoc search.
Gradient Interference: Naïve attempts to align multiple layers simultaneously using independent projectors for each layer often lead to gradient conflicts. The projectors learn inconsistent mappings, causing gradients to cancel each other out and resulting in performance collapse.

2. Methodology: ROCKET

The authors propose ROCKET, a framework that formulates multi-layer alignment as aligning one residual stream to another, specifically designed to mitigate gradient interference.

Core Components:

Shared Projector (Layer-Invariant Mapping):
- Instead of using separate projectors for each layer, ROCKET employs a single shared projector to align multiple layers of the VLA backbone with multiple layers of the 3D teacher model.
- Theoretical Justification: The authors analyze the problem through the lens of residual dynamics. In a Pre-LN residual network, back-propagation through depth is close to identity. Therefore, the gradient at an early layer is a superposition of gradients from all aligned future layers.
- With independent projectors, the interaction matrices between layers are unstructured, leading to destructive interference (gradients canceling out).
- With a shared projector, the interaction matrix is constrained to a common Positive Semi-Definite (PSD) structure (plus controlled deviation). This ensures that cross-layer gradients are constructive, reinforcing rather than canceling each other.
Matryoshka-Style Sparse Activation:
- Empirical analysis shows that shallow layers converge more easily and are more similar between the student and teacher models, while deeper layers contain more complex, global spatial information.
- If a shared projector is used naively, shallow layers might dominate the optimization, preventing deeper layers from learning complex mappings.
- Solution: ROCKET introduces a Matryoshka-style sparse activation scheme. The shared projector has a maximum width $m$ . For shallower layers, only a subset of the projector's parameters (the first $m_i$ channels) are activated. As the layer depth increases, the number of activated parameters increases monotonically.
- This balances the alignment losses: shallow layers use fewer parameters to capture common local cues quickly, while deeper layers utilize the full capacity to refine global spatial information.
Training-Free Layer Selection:
- ROCKET utilizes a simple heuristic (uniform sampling or "E2M-Last1") to select layer pairs, avoiding the need for expensive hyperparameter searches to find the optimal alignment layer.

3. Key Contributions

Novel Framework: Introduced ROCKET, the first multi-layer alignment framework specifically designed to inject 3D spatial reasoning into 2D-pretrained VLA models without gradient interference.
Theoretical Insight: Provided a theoretical analysis demonstrating that shared projectors promote gradient coherence in residual networks, whereas independent projectors lead to gradient conflicts. This explains why prior multi-layer distillation methods failed in the VLA setting.
Efficient Architecture: Proposed the Matryoshka-style sparse activation mechanism to balance supervision across depths, ensuring that both shallow and deep layers contribute effectively to the shared mapping.
Performance & Efficiency: Demonstrated that ROCKET achieves State-of-the-Art (SOTA) performance with significantly reduced computational costs compared to existing methods.

4. Experimental Results

The authors evaluated ROCKET on multiple benchmarks (LIBERO, LIBERO-Plus, RoboTwin 2.0) and various VLA backbones (OpenVLA, PI0.5).

LIBERO Benchmark:
- ROCKET achieved a 98.5% average success rate, matching or slightly exceeding the best single-layer alignment method (Spatial Forcing, 98.5%) and significantly outperforming standard 2D VLAs and other 3D alignment baselines.
- Compute Efficiency: ROCKET reached SOTA performance using only ~4% of the compute budget required by prior SOTA methods (e.g., Spatial Forcing required 24x more compute).
Robustness (LIBERO-Plus):
- ROCKET showed superior robustness under spatial perturbations (Robot and Layout shifts), indicating it learns genuine spatial reasoning rather than relying on positional shortcuts.
Ablation Studies:
- Multi-layer + Independent Projectors: Performance dropped to ~80%, confirming gradient interference.
- Shared Projector: Improved performance to 98.2%.
- Matryoshka Activation: Further boosted performance to 98.5%.
Data Efficiency: ROCKET maintained strong performance even when trained on only 10% of the demonstration data.

5. Significance

ROCKET represents a significant step forward in Embodied AI and robotic manipulation.

Scalability: It offers a scalable path to 3D-aware robotics without requiring explicit 3D sensors (like LiDAR) or computationally expensive depth estimation during inference.
Training Efficiency: By solving the gradient interference problem, it allows for the simultaneous exploitation of hierarchical spatial cues from deep networks, drastically reducing the training compute required to reach SOTA.
Generalization: The method is model-agnostic and has been validated across different VLA architectures and datasets, suggesting it is a robust solution for improving spatial grounding in foundation models for robotics.

In summary, ROCKET transforms the challenge of multi-layer distillation from a source of instability into a powerful tool for enhancing the spatial reasoning capabilities of vision-language-action models, achieving high performance with minimal computational overhead.