UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

The paper introduces UniLACT, a vision-language-action model that leverages the UniLARN framework to learn depth-aware latent action representations from unlabeled RGB-D videos, thereby enhancing spatial priors and outperforming RGB-only baselines in both simulation and real-world manipulation tasks.

Manish Kumar Govind, Dominick Reilly, Pu Wang, Srijan Das

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot to do chores, like picking up a carrot and putting it in a bowl.

Most robots today learn by watching thousands of videos of humans doing these tasks. They look at the video and try to guess, "What move should I make next?" This is like learning to drive by watching a movie of a car driving. You see the car turn, but you don't feel the road, the wind, or the distance to the curb. You only see the colors and shapes.

The problem is that robots need to know depth (how far away things are) to avoid crashing. If a robot only sees a 2D picture, it might think a bowl is right next to the carrot, when in reality, the carrot is three inches away. If it tries to grab, it might smash the bowl.

This paper introduces a new robot brain called UNILACT. Here is how it works, explained simply:

1. The Problem: The "Flat" Robot

Current robot brains learn from "RGB" videos (standard red-green-blue camera footage). They are great at recognizing that a carrot is orange and a bowl is white. But they are "blind" to the third dimension. They are like a person trying to catch a ball while wearing blinders that only show color, not distance. They struggle with tasks that require touching things precisely, like stacking blocks or avoiding collisions.

2. The Solution: The "3D Glasses" Training

The researchers created a two-step training process to give the robot "3D glasses" without actually making it wear glasses while it works.

Step 1: The "Shadow Puppet" Teacher (UNILARN)

First, they built a special training tool called UNILARN.

  • The Analogy: Imagine you are teaching a student to draw a 3D cube. You show them a flat picture of the cube (RGB) and a shadow of the cube cast on a wall (Depth).
  • How it works: UNILARN looks at both the video and the depth map (a digital version of a shadow that shows distance) at the same time. It learns to compress this information into a single "secret code" (a latent action).
  • The Magic: It forces the robot to understand that "moving the arm forward" isn't just a color change; it's a change in distance. It creates a Unified Secret Code that contains both the look of the object and its shape/location.

Step 2: The "Student" Learns the Code (UNILACT)

Next, they train the main robot brain, UNILACT, using these secret codes.

  • The Analogy: Now, the student (UNILACT) is taking a test. The teacher gives them a task ("Pick up the carrot") and a video of the scene. The student has to guess the "Secret Code" for the next move.
  • The Twist: Even though the student was trained using the "shadows" (depth), the test is given without the shadows. The student has to rely on the memory of the 3D structure they learned during training.
  • The Result: Because the student learned the 3D structure during training, they can now look at a flat video and "imagine" the depth. They know exactly how far to reach without crashing.

3. The Real-World Test

The researchers tested this on a real robot arm.

  • The Old Way (Moto): The robot tried to pick up a carrot and put it in a bowl. It saw the carrot, reached out, but because it couldn't "feel" the distance, it missed the bowl and knocked it over.
  • The New Way (UNILACT): The robot saw the carrot, remembered the 3D training, reached out with perfect precision, and dropped the carrot right into the bowl without touching the sides.

Why This Matters

Think of it like learning to ride a bike.

  • Old Method: You watch a video of someone riding. You learn the colors of the bike and the road. When you try it, you fall because you don't understand balance or distance.
  • UNILACT Method: You watch the video, but you also wear special goggles that show you the distance to the ground and other objects. You practice with the goggles. Then, when you take the goggles off to ride for real, your brain still remembers the 3D map. You ride smoothly and don't crash.

In short: UNILACT teaches robots to "see" in 3D by training them with depth information, even though they only use standard cameras when they are actually doing the job. This makes them much better at delicate tasks like cooking, cleaning, and assembling things.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →