From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

FALCON addresses the spatial reasoning limitations of existing 2D-based vision-language-action models by leveraging spatial foundation models to inject rich 3D geometric priors directly into the action head, achieving state-of-the-art performance across diverse simulation and real-world tasks without requiring architectural changes or specialized sensors.

Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot to make a sandwich. You tell it, "Put the peanut butter on the bread."

Older robot brains (called VLA models) are like a person who has read a million cookbooks but has never actually been in a kitchen. They understand the words perfectly. They know what "peanut butter" and "bread" are. But if you ask them to reach for the jar, they might grab the wrong one, miss the bread entirely, or try to put the jar inside the bread because they only see the world as a flat 2D picture, like a photograph. They lack a sense of depth and space.

The paper introduces a new robot brain called FALCON (From Spatial to Action). Think of FALCON as giving that robot a pair of 3D glasses and a sense of balance.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Flat World" Trap

Most current robots are built on "2D encoders." They look at the world like a painting.

  • The Issue: If you hold a cup close to the camera, it looks huge. If you hold it far away, it looks tiny. A 2D robot gets confused. It doesn't know how far to reach or how big the object really is.
  • The Result: The robot struggles with things like stacking blocks of different sizes, reaching for items on high shelves, or navigating a messy room where objects are hidden behind others.

2. The Solution: FALCON's "3D Glasses"

FALCON solves this by injecting 3D spatial tokens into the robot's decision-making process.

  • The Analogy: Imagine the robot's brain has two main parts:
    1. The Librarian (The VLM): This part reads your instructions and understands the meaning of the words. It knows you want a "red cup."
    2. The Pilot (The Action Head): This part actually moves the arm.
  • The Innovation: In the past, researchers tried to force the Librarian to also be the Pilot, which confused the Librarian and made it forget how to read.
  • FALCON's Trick: FALCON keeps the Librarian pure. It lets the Librarian do what it does best (understand language). Then, it takes a separate, specialized "Spatial Sense" module and hands a map of the 3D world directly to the Pilot. The Pilot now knows exactly how far to reach and how to grip, without confusing the Librarian.

3. The "Embodied Spatial Model": The Swiss Army Knife

One of the coolest features of FALCON is its flexibility.

  • Scenario A (No 3D Sensors): If the robot only has a standard camera (like a phone), FALCON uses a "magic trick" (a foundation model) to guess the depth and 3D shape of the room just by looking at the flat image. It's like looking at a photo of a mountain and instinctively knowing the peak is far away.
  • Scenario B (With 3D Sensors): If the robot does have a fancy depth camera (like a LiDAR or a 3D sensor), FALCON can plug that data in too.
  • The Benefit: You don't need to retrain the robot or change its brain. It works perfectly whether you give it a cheap camera or a super-expensive 3D scanner. It's the ultimate "plug-and-play" spatial brain.

4. Why It Matters: The "Cluttered Kitchen" Test

The researchers tested FALCON in messy, real-world scenarios:

  • The Challenge: "Pick up the red cup that is behind the blue box."
  • Old Robots: Often crash into the blue box or grab the wrong cup because they can't "see" the depth.
  • FALCON: Successfully navigates the clutter, understands that the red cup is further back, and reaches around the obstacle.

It also handles size changes better. If you ask a robot to stack a small block on a big one, old robots often drop the small one because they misjudge the size. FALCON gets it right because it has a true sense of scale.

Summary

FALCON is like giving a robot a brain that combines the wisdom of a language expert with the spatial awareness of a human.

  • It doesn't force the robot to "think" in 3D while trying to read a sentence (which causes confusion).
  • Instead, it gives the robot a dedicated "spatial sense" that feeds directly into its hands.
  • It works with cheap cameras or expensive 3D sensors, making it ready for real-world homes and factories.

In short: FALCON stops robots from being "clumsy dreamers" who understand words but can't reach, and turns them into "skilled workers" who can actually do the job.