UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Imagine you are teaching a robot to do chores, like picking up a carrot and putting it in a bowl.

Most robots today learn by watching thousands of videos of humans doing these tasks. They look at the video and try to guess, "What move should I make next?" This is like learning to drive by watching a movie of a car driving. You see the car turn, but you don't feel the road, the wind, or the distance to the curb. You only see the colors and shapes.

The problem is that robots need to know depth (how far away things are) to avoid crashing. If a robot only sees a 2D picture, it might think a bowl is right next to the carrot, when in reality, the carrot is three inches away. If it tries to grab, it might smash the bowl.

This paper introduces a new robot brain called UNILACT. Here is how it works, explained simply:

1. The Problem: The "Flat" Robot

Current robot brains learn from "RGB" videos (standard red-green-blue camera footage). They are great at recognizing that a carrot is orange and a bowl is white. But they are "blind" to the third dimension. They are like a person trying to catch a ball while wearing blinders that only show color, not distance. They struggle with tasks that require touching things precisely, like stacking blocks or avoiding collisions.

2. The Solution: The "3D Glasses" Training

The researchers created a two-step training process to give the robot "3D glasses" without actually making it wear glasses while it works.

Step 1: The "Shadow Puppet" Teacher (UNILARN)

First, they built a special training tool called UNILARN.

The Analogy: Imagine you are teaching a student to draw a 3D cube. You show them a flat picture of the cube (RGB) and a shadow of the cube cast on a wall (Depth).
How it works: UNILARN looks at both the video and the depth map (a digital version of a shadow that shows distance) at the same time. It learns to compress this information into a single "secret code" (a latent action).
The Magic: It forces the robot to understand that "moving the arm forward" isn't just a color change; it's a change in distance. It creates a Unified Secret Code that contains both the look of the object and its shape/location.

Step 2: The "Student" Learns the Code (UNILACT)

Next, they train the main robot brain, UNILACT, using these secret codes.

The Analogy: Now, the student (UNILACT) is taking a test. The teacher gives them a task ("Pick up the carrot") and a video of the scene. The student has to guess the "Secret Code" for the next move.
The Twist: Even though the student was trained using the "shadows" (depth), the test is given without the shadows. The student has to rely on the memory of the 3D structure they learned during training.
The Result: Because the student learned the 3D structure during training, they can now look at a flat video and "imagine" the depth. They know exactly how far to reach without crashing.

3. The Real-World Test

The researchers tested this on a real robot arm.

The Old Way (Moto): The robot tried to pick up a carrot and put it in a bowl. It saw the carrot, reached out, but because it couldn't "feel" the distance, it missed the bowl and knocked it over.
The New Way (UNILACT): The robot saw the carrot, remembered the 3D training, reached out with perfect precision, and dropped the carrot right into the bowl without touching the sides.

Why This Matters

Think of it like learning to ride a bike.

Old Method: You watch a video of someone riding. You learn the colors of the bike and the road. When you try it, you fall because you don't understand balance or distance.
UNILACT Method: You watch the video, but you also wear special goggles that show you the distance to the ground and other objects. You practice with the goggles. Then, when you take the goggles off to ride for real, your brain still remembers the 3D map. You ride smoothly and don't crash.

In short: UNILACT teaches robots to "see" in 3D by training them with depth information, even though they only use standard cameras when they are actually doing the job. This makes them much better at delicate tasks like cooking, cleaning, and assembling things.

1. Problem Statement

Vision-Language-Action (VLA) models have shown promise in generalizing robot policies across novel tasks and objects. However, current approaches face two primary limitations:

Data Scarcity: High-performance VLAs typically require large-scale, expensive robotic data collected via human teleoperation, limiting scalability.
Lack of Geometric Understanding: Recent unsupervised pretraining methods use latent action representations learned from unlabeled internet videos to avoid costly annotations. However, these methods rely solely on RGB observations. Consequently, the learned latent actions capture appearance-driven dynamics but lack explicit 3D geometric structure. This deficiency hinders performance in contact-rich manipulation tasks (e.g., precise grasping, placement, collision avoidance) where understanding depth and spatial relationships is critical.

Existing methods that incorporate depth usually treat it as a pixel-level input to the policy network, requiring labeled data. No prior work has successfully integrated depth cues directly into the latent action representation space during unsupervised pretraining to provide downstream policies with stronger spatial priors without needing additional labeled data.

2. Methodology

The authors propose a two-stage framework consisting of UNILARN (for learning representations) and UNILACT (the VLA model). The training pipeline involves three stages:

A. UNILARN: Unified Latent Action Learning

UNILARN is a framework designed to learn a shared embedding space for both RGB and Depth observations using Inverse Dynamics Models (IDM) and Forward Dynamics Models (FDM).

Input: Paired RGB and Depth frames $(o_t, o_{t+H})$ .
Modality-Specific Encoding: Separate IDMs map RGB and Depth pairs to continuous latent embeddings.
Discretization (VQ): These embeddings are discretized using a shared Vector Quantization (VQ) codebook to create modality-specific discrete latent tokens ( $z_t^r, z_t^d$ ).
Unification: The modality-specific embeddings are concatenated and projected into a continuous unified space, then discretized again via a second VQ codebook to produce a Unified Latent Action ( $z_t^u$ ).
Reconstruction Objective: The unified latent $z_t^u$ is used to condition modality-specific FDMs to reconstruct future observations. This forces the unified latent to capture complementary dynamics from both modalities, ensuring it retains sufficient geometric and semantic information.

B. UNILACT: Unified Latent Pretraining

UNILACT is a transformer-based VLA model pretrained using the latent actions generated by UNILARN.

Architecture: Built on a GPT-2 causal transformer backbone with frozen T5 (text) and ViT-L (vision) encoders.
Training Objective: The model performs cross-modal autoregressive next-token prediction. Given visual observations and task instructions, it predicts a sequence of latent tokens. Crucially, it is trained to predict RGB, Depth, or Unified latent tokens interchangeably.
Goal: This cross-modal setup forces the model to align the semantic (RGB) and geometric (Depth) spaces, learning a unified representation that encodes both appearance and 3D structure.
Inference Note: Depth is used only during training. At inference time, UNILACT operates solely on RGB observations and task instructions.

C. Action Fine-Tuning

The pretrained UNILACT is fine-tuned on a small set of robot demonstrations (action-labeled trajectories).
Mechanism: Action query tokens are appended to the input sequence. The model maps the unified latent tokens to continuous robot actions (7-DoF end-effector deltas and gripper commands) via a lightweight decoder.
Loss Function: A combination of latent action prediction loss and regression loss for continuous actions.

3. Key Contributions

UNILARN Framework: A novel unified latent action learning framework that jointly learns modality-specific and unified latent representations from RGB and Depth using a two-stage vector-quantization pipeline. It explicitly models cross-modal interactions to embed 3D geometric structure into the latent space.
UNILACT Model: The first VLA to incorporate depth cues directly into the latent action representation space during unsupervised pretraining. It leverages both unified and modality-specific latent tokens to improve policy learning.
Depth-Aware Priors: Demonstrates that pretraining with depth-aware unified latents allows downstream policies to inherit stronger spatial priors, significantly improving performance in contact-rich tasks without requiring depth sensors at inference time.

4. Experimental Results

The authors evaluated UNILACT in both simulation (CALVIN benchmark) and real-world settings.

Simulation (CALVIN ABC→D):
- In-Domain: UNILACT outperformed the state-of-the-art RGB-based latent baseline (Moto) with an average sequence length of 2.86 vs. 2.60.
- Out-of-Domain (OXE Pretraining): UNILACT achieved a 29.2% relative improvement over Moto (Avg. Len: 3.10 vs. 2.40).
- Task Analysis: While RGB latents performed well on appearance-driven tasks, the unified (RGB+Depth) latents showed significant gains on geometry-centric tasks (e.g., moving sliders, turning light bulbs).
Real-World Experiments (xArm7 Manipulator):
- Evaluated on 4 tabletop manipulation tasks (2 seen, 2 unseen).
- Performance: UNILACT achieved a 62.5% average success rate, a 10% improvement over the Moto baseline (52.5%).
- Qualitative: In tasks requiring precise placement (e.g., placing a carrot in a bowl), the RGB baseline failed due to depth misjudgment (collisions), whereas UNILACT successfully executed the task with accurate spatial reasoning.
Efficiency: Despite using depth during training, UNILACT maintains the same model size (89.8M parameters) and inference latency (27ms) as the RGB-only baseline, as depth is not required at inference.
Ablation Studies:
- Modality: Joint learning in a shared space (Unified) outperformed independent learning of RGB and Depth latents.
- Output Targets: Pretraining with cross-modal targets (predicting RGB, Depth, and Unified) combined with fine-tuning on Unified targets yielded the best results.
- Training Objective: Single-task learning (predicting one modality at a time) outperformed multi-task joint optimization, suggesting that joint optimization introduces task interference.

5. Significance

This work bridges the gap between unsupervised video pretraining and geometric reasoning in robotics. By embedding 3D structural information directly into the latent action space, UNILACT enables robots to learn spatially aware policies from unlabeled data without the cost of teleoperation or the need for depth sensors during deployment. This approach offers a scalable path toward robust, generalizable robot manipulation in complex, contact-rich environments.