From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Imagine you are teaching a robot to make a sandwich. You tell it, "Put the peanut butter on the bread."

Older robot brains (called VLA models) are like a person who has read a million cookbooks but has never actually been in a kitchen. They understand the words perfectly. They know what "peanut butter" and "bread" are. But if you ask them to reach for the jar, they might grab the wrong one, miss the bread entirely, or try to put the jar inside the bread because they only see the world as a flat 2D picture, like a photograph. They lack a sense of depth and space.

The paper introduces a new robot brain called FALCON (From Spatial to Action). Think of FALCON as giving that robot a pair of 3D glasses and a sense of balance.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Flat World" Trap

Most current robots are built on "2D encoders." They look at the world like a painting.

The Issue: If you hold a cup close to the camera, it looks huge. If you hold it far away, it looks tiny. A 2D robot gets confused. It doesn't know how far to reach or how big the object really is.
The Result: The robot struggles with things like stacking blocks of different sizes, reaching for items on high shelves, or navigating a messy room where objects are hidden behind others.

2. The Solution: FALCON's "3D Glasses"

FALCON solves this by injecting 3D spatial tokens into the robot's decision-making process.

The Analogy: Imagine the robot's brain has two main parts:
1. The Librarian (The VLM): This part reads your instructions and understands the meaning of the words. It knows you want a "red cup."
2. The Pilot (The Action Head): This part actually moves the arm.
The Innovation: In the past, researchers tried to force the Librarian to also be the Pilot, which confused the Librarian and made it forget how to read.
FALCON's Trick: FALCON keeps the Librarian pure. It lets the Librarian do what it does best (understand language). Then, it takes a separate, specialized "Spatial Sense" module and hands a map of the 3D world directly to the Pilot. The Pilot now knows exactly how far to reach and how to grip, without confusing the Librarian.

3. The "Embodied Spatial Model": The Swiss Army Knife

One of the coolest features of FALCON is its flexibility.

Scenario A (No 3D Sensors): If the robot only has a standard camera (like a phone), FALCON uses a "magic trick" (a foundation model) to guess the depth and 3D shape of the room just by looking at the flat image. It's like looking at a photo of a mountain and instinctively knowing the peak is far away.
Scenario B (With 3D Sensors): If the robot does have a fancy depth camera (like a LiDAR or a 3D sensor), FALCON can plug that data in too.
The Benefit: You don't need to retrain the robot or change its brain. It works perfectly whether you give it a cheap camera or a super-expensive 3D scanner. It's the ultimate "plug-and-play" spatial brain.

4. Why It Matters: The "Cluttered Kitchen" Test

The researchers tested FALCON in messy, real-world scenarios:

The Challenge: "Pick up the red cup that is behind the blue box."
Old Robots: Often crash into the blue box or grab the wrong cup because they can't "see" the depth.
FALCON: Successfully navigates the clutter, understands that the red cup is further back, and reaches around the obstacle.

It also handles size changes better. If you ask a robot to stack a small block on a big one, old robots often drop the small one because they misjudge the size. FALCON gets it right because it has a true sense of scale.

Summary

FALCON is like giving a robot a brain that combines the wisdom of a language expert with the spatial awareness of a human.

It doesn't force the robot to "think" in 3D while trying to read a sentence (which causes confusion).
Instead, it gives the robot a dedicated "spatial sense" that feeds directly into its hands.
It works with cheap cameras or expensive 3D sensors, making it ready for real-world homes and factories.

In short: FALCON stops robots from being "clumsy dreamers" who understand words but can't reach, and turns them into "skilled workers" who can actually do the job.

Here is a detailed technical summary of the paper "From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors" (FALCON).

1. Problem Statement

Current Vision-Language-Action (VLA) models have advanced generalist robotics by enabling robots to interpret natural language and execute actions. However, most state-of-the-art VLAs are built upon 2D foundation models (like VLMs), creating a critical gap between 2D semantic understanding and the 3D physical world. This leads to three primary limitations:

Weak Spatial Reasoning: Models struggle with geometry, depth, and spatial relations (e.g., object height, scale, or "closest to" instructions), leading to poor generalization in novel scenes.
Low Modality Transferability: Existing 3D-integration methods often rely on specific sensors (e.g., LiDAR, RGB-D) or specialized datasets. They fail to function robustly when these inputs are unavailable (e.g., RGB-only) or when switching modalities without retraining.
Alignment Degradation: Injecting 3D cues directly into the VLM backbone often disrupts the pre-trained vision-language alignment, causing "embedding drift" and degrading zero-shot generalization.

2. Methodology: FALCON Architecture

FALCON proposes a novel paradigm that decouples spatial processing from the semantic backbone, injecting rich 3D priors directly into the action generation mechanism. The architecture consists of three core components:

A. 2D Vision-Language Model (VLM) Backbone

Role: Handles high-level semantic reasoning and language understanding.
Implementation: Uses a pre-trained 2D VLM (e.g., Kosmos-2) to process visual observations and language instructions, producing a semantic action token ( $\hat{t}_{act}$ ).
Design Choice: The VLM remains largely untouched to preserve its pre-trained semantic integrity.

B. Embodied Spatial Model (ESM)

Role: Extracts robust 3D geometric priors from visual inputs.
Mechanism: Based on spatial foundation models (inspired by VGGT/DUSt3R), the ESM encodes images into spatial tokens ( $T_{spl}$ ).
Modality Flexibility:
- RGB-only: Generates 3D priors from monocular images using learned geometric reconstruction.
- RGB-D/Pose: Optionally fuses depth maps and camera poses to refine spatial accuracy.
- Stochastic Conditioning: During training, the model randomly decides whether to inject depth/pose data. This ensures the model learns to reason spatially even when 3D sensors are absent, maximizing transferability.

C. Spatial-Enhanced Action Head

Role: The critical innovation where spatial and semantic information merge to generate actions.
Mechanism: Instead of concatenating spatial tokens into the VLM (which disrupts alignment), FALCON uses a lightweight fusion mechanism:
1. Spatial tokens from the ESM are max-pooled and projected into the VLM's feature space.
2. These projected spatial features are fused with the semantic action token via element-wise addition.
3. The fused vector is passed to an action predictor (MLP or LSTM) to output the robot's action sequence.
Biological Inspiration: This design mimics the brain's division of labor: the VLM acts as the "cerebrum" (reasoning), while the Action Head acts as the "cerebellum" (fine-grained motor control), allowing precise spatial control without corrupting semantic understanding.

D. Training Strategy

Two-Stage Post-Training:
1. Stage 1 (Alignment): Freeze the VLM and ESM; optimize only a lightweight adapter to align spatial tokens with the VLA feature space.
2. Stage 2 (Refinement): Unfreeze the VLM and adapter for joint refinement, allowing the VLM to implicitly adapt its semantics to incorporate spatial cues.

3. Key Contributions

Robust 3D Spatial Reasoning: By leveraging spatial foundation models, FALCON delivers strong geometric priors from RGB alone, significantly outperforming methods that rely on weak pseudo-depth or learnable embeddings.
Superior Modality Transferability: The ESM allows the model to function effectively with RGB-only inputs while seamlessly improving performance when RGB-D or pose data is available, without architectural changes or retraining.
Principled Integration: The Spatial-Enhanced Action Head avoids disrupting the VLM's pre-trained vision-language alignment, a common failure point in previous 3D-VLA attempts.
State-of-the-Art Performance: Achieves SOTA results across diverse benchmarks, demonstrating robustness in cluttered scenes, few-shot adaptation, and handling unseen object scales/heights.

4. Experimental Results

FALCON was evaluated on three simulation benchmarks (CALVIN, SimplerEnv, Google Robot) and 11 real-world tasks.

Simulation Benchmarks:
- CALVIN: FALCON achieved SOTA performance in long-horizon tasks (ABCD→D and ABC→D). Notably, in the zero-shot ABC→D setting, it surpassed methods relying on ground-truth point clouds (e.g., 3DDP, 3D Diffuser Actor) by a significant margin (Avg. Len. 4.40 vs. 3.35).
- SimplerEnv: Outperformed all baselines on both WidowX and Google Robot platforms. On the challenging "Open Top Drawer and Place Apple" task, FALCON achieved 41.7% success, while the large-scale closed-source RT-2-X (55B params) achieved only 3.7%.
Real-World Tasks:
- Base Tasks: Achieved a 70.0% average success rate, outperforming SpatialVLA (44.4%) by 25.6%.
- Few-Shot Adaptation: Demonstrated exceptional generalization, achieving 80% success on unseen object variations where other models failed near-zero.
- Spatial Understanding: Successfully handled tasks requiring height awareness (e.g., placing cups on shelves of varying heights) and scale variations (stacking blocks of different sizes), areas where baselines frequently collided or dropped objects.
Ablation Studies: Confirmed that injecting spatial tokens directly into the Action Head (rather than the VLM) is the optimal strategy. Element-wise addition was found to be the most effective and efficient fusion method.

5. Significance and Impact

FALCON represents a paradigm shift in generalist robot policy design. By decoupling spatial reasoning from semantic understanding, it solves the "spatial gap" inherent in 2D-based VLAs without sacrificing their language capabilities.

Practicality: It reduces dependency on expensive 3D sensors, making advanced robotic manipulation more accessible and deployable in real-world environments where only RGB cameras are available.
Scalability: The ability to leverage optional 3D modalities (depth/pose) when available, while maintaining performance without them, offers a flexible path for scaling robot policies across diverse hardware setups.
Generalization: The model's robustness to object scale, height, and clutter suggests a path toward truly generalist robots capable of operating in unstructured, dynamic human environments.