Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Imagine you are trying to teach a pair of robot arms to perform a delicate dance, like folding a shirt or assembling a toy. The biggest challenge isn't just telling the arms where to move; it's teaching them to understand the 3D world around them and predict what will happen next before they even touch anything.

This paper introduces a new way to teach robots to do this, which the authors call Action–Geometry Prediction. Here is the breakdown in simple terms:

The Problem: The Robot's "Flat" Vision

Most current robot brains are like people wearing 2D glasses. They look at a camera feed (a flat picture) and try to guess where objects are in 3D space.

The 2D Approach: It's like trying to judge the distance of a car just by looking at a photograph. It's hard, and the robot often bumps into things or drops them because it doesn't "feel" the depth.
The Point Cloud Approach: Other methods try to give the robot a 3D scanner (like a LIDAR or depth camera) to get a cloud of dots representing the world. But in the real world, these sensors are messy. Dust, shiny objects, or bad lighting can make the "dots" disappear or look wrong, causing the robot to freeze or crash.

The Solution: The "Crystal Ball" Robot

The authors propose a robot that doesn't need a 3D scanner. Instead, it uses a pre-trained "Geometric Foundation Model."

Think of this model as a super-smart artist who has studied millions of 3D movies and photos. Even if you only show this artist a flat 2D picture, they can instantly imagine the full 3D shape of the object, how the light hits it, and how it would look from the other side.

The robot uses this "artist" as its eyes. It looks at a standard camera image and instantly builds a mental 3D map of the room.

The Secret Sauce: Predicting the Future

The real magic of this paper is that the robot doesn't just look at the now; it predicts the future.

Imagine you are juggling. A good juggler doesn't just look at the ball in their hand; they are already thinking, "If I throw this ball up, where will it be in one second? And where will my other hand need to be to catch it?"

This robot does the same thing, but with math:

The Action: It decides what move to make next (e.g., "Grab the cup").
The Imagination: At the exact same time, it predicts what the 3D world will look like after that move is made. It essentially asks, "If I move my arm here, how will the cup and the table look in 3D space?"

By forcing the robot to answer both questions together, it learns to understand the physics of the world much better. It's like practicing a dance by not just memorizing the steps, but also visualizing the entire stage changing as you move.

How It Works (The Analogy)

Think of the robot's brain as a conductor of an orchestra:

The 2D Music: The camera sees the colors and shapes (the melody).
The 3D Music: The "Geometric Artist" model hears the depth and structure (the harmony).
The Body: The robot's own sensors (proprioception) tell it where its arms are (the rhythm).

The conductor (the AI policy) mixes all these sounds together. Then, instead of just playing the next note, the conductor predicts the next whole section of the song (the action) and imagines how the sound of the orchestra will change (the future 3D shape).

The Results

The researchers tested this on a robot with two arms (a "bimanual" robot) doing tricky tasks like stacking bowls, hanging mugs, and placing shoes.

In Simulation: The robot was a clear winner, beating robots that relied only on 2D cameras and robots that relied on messy 3D scanners.
In the Real World: When they put it on a real robot in a real room, it still worked best. It was able to handle tasks that made other robots fail completely (like hanging a mug without dropping it).

Why This Matters

This is a big deal because it means we can build smarter robots without expensive, fragile 3D sensors. We can just use a standard webcam and a powerful software brain that "imagines" the 3D world. It makes robots safer, cheaper, and better at doing complex jobs that require two hands working together perfectly.

In short: They taught a robot to "see" in 3D using only a 2D camera, and to "think ahead" by imagining how the world will change after it moves. It's like giving the robot a crystal ball to see the future of its own actions.

1. Problem Statement

Bimanual manipulation requires robots to perform coordinated, temporally smooth, and dynamically consistent actions, often in cluttered or dynamic environments. Current state-of-the-art methods face two primary limitations:

2D-Based Approaches: Methods like ACT and Diffusion Policies rely on 2D RGB features. While effective for temporal smoothing, they lack explicit 3D spatial awareness, making it difficult to reason about occlusions, depth, and complex contact interactions.
3D-Based Approaches: Methods like DP3 and G3Flow utilize point clouds for better geometric reasoning. However, these require explicit point cloud acquisition (via depth sensors or LiDAR) and meticulous camera calibration. In real-world settings, obtaining high-quality, noise-free point clouds is difficult, limiting scalability and generalization.

The core challenge is to achieve 3D-aware predictive control for bimanual robots using only RGB observations, without relying on explicit point cloud pipelines or strict calibration.

2. Methodology

The authors propose an end-to-end framework that leverages a pre-trained 3D geometric foundation model (specifically $\pi^3$ ) as a core perception prior. The system jointly predicts future actions and future 3D scene geometry.

Key Components:

Multi-Modal Input Encoding:
- Geometry 3D Encoder: Uses the pre-trained $\pi^3$ model to process a sequence of temporal RGB frames (past + current). It extracts geometry-aware latent features ( $f_{3d}$ ) that encode multi-view 3D structure without needing depth sensors.
- Semantics 2D Encoder: Uses a pre-trained 2D foundation model (DINOv3) to extract semantic features ( $f_{2d}$ ) from the current frame, providing task-relevant object understanding.
- State Encoder: An MLP encodes the robot's proprioceptive state (joint angles and gripper states for both arms) into an embedding ( $f_p$ ).
Semantic-Geometric Fusion:
- The three heterogeneous features ( $f_{3d}, f_{2d}, f_p$ ) are concatenated and processed by a Transformer encoder (DETR-based) to create a unified Semantic-Geometric Fused Context ( $f_c$ ).
Joint Action–Geometry Denoiser (Diffusion Policy):
- A conditional diffusion model takes the fused context $f_c$ and performs a joint denoising process.
- Dual Prediction Targets:
  - Future Action Chunk: A sequence of $N$ future robot actions (joint positions and gripper states).
  - Future 3D Latent: A latent representation that decodes into a dense pointmap ( $P_{t+N}$ ) representing the predicted 3D scene state at the future horizon.
- Training Objective: The model is trained to minimize the L1 loss between the predicted and ground-truth action chunks, 3D latents, and pointmaps. Crucially, the "geometric imagination" (predicting the future pointmap) forces the policy to learn how the 3D scene evolves under its own actions.
Inference:
- The model starts with Gaussian noise and iteratively denoises it to generate the action chunk and the future 3D latent. The action chunk is executed by the robot. The 3D pointmap decoding can be skipped during inference for efficiency, as the latent itself contains the necessary geometric reasoning.

3. Key Contributions

RGB-Only 3D-Aware Policy: The first framework to leverage a pre-trained 3D geometric foundation model for bimanual manipulation, enabling robust 3D perception and reasoning using only RGB inputs, eliminating the need for explicit point clouds or depth sensors.
Explicit Future 3D Prediction: Introduces a novel "geometric imagination" mechanism where the policy jointly predicts future actions and the future 3D scene structure (dense pointmap). This forces the policy to anticipate spatial evolution, improving long-horizon planning.
Unified State Representation: Successfully fuses 2D semantic cues, 3D geometric priors, and proprioception into a single context for control, bridging the gap between semantic understanding and geometric reasoning.

4. Experimental Results

The method was evaluated on the RoboTwin 2.0 benchmark (simulation) and on a real-world AgileX Cobot Magic system.

Simulation Performance:
- Dominant-Select Tasks: Achieved 63.2% average success rate, outperforming 2D baselines (ACT: 34.1%, DP: 44.4%) and 3D baselines (DP3: 61.2%). Notably, it surpassed DP3 despite using only 2D inputs.
- Sync-Bimanual Tasks: Achieved 51.3% average success rate, significantly outperforming DP3 (45.1%) and G3Flow (45.8%). It showed particular strength in complex coordination tasks like "Place Dual Shoes."
- Seq-Coordinate Tasks: Achieved 50.4% average success rate, leading all baselines. It excelled in long-horizon tasks like "Hang Mug" (40.0% vs. 26.7% for G3Flow), demonstrating superior temporal and spatial reasoning.
- Data Efficiency: The method showed superior sample efficiency, learning effectively with as few as 10–20 demonstrations where 2D baselines often failed completely.
Real-World Performance:
- On four challenging real-world tasks, the method achieved an average success rate of 40%, significantly outperforming the best baseline (Xu et al. at 32.5%).
- It demonstrated robustness in tasks where baselines failed entirely (e.g., 20% success on "Hanging Mug" vs. 0% for ACT/DP).
Ablation Studies:
- Removing the Geometric Imagination (future pointmap prediction) caused a significant drop in performance (from 25.1% to 23.6%), confirming that predicting future geometry is critical for planning.
- Removing the 3D Geometric Module caused the largest drop (to 21.0%), proving the necessity of the 3D foundation prior.

5. Significance

This work represents a paradigm shift in robotic manipulation by demonstrating that pre-trained 3D geometric foundation models can serve as powerful, calibration-free priors for control.

Scalability: By removing the dependency on expensive depth sensors and complex point cloud pipelines, this approach makes advanced bimanual manipulation more accessible and deployable in unstructured real-world environments.
Predictive Capability: The "Action-Geometry" joint prediction framework provides a new direction for world models in robotics, showing that explicitly modeling the future 3D state improves the quality of action generation.
State-of-the-Art: The method sets a new benchmark for bimanual manipulation, achieving superior performance in coordination, spatial reasoning, and generalization compared to both 2D and traditional 3D approaches.

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

The Problem: The Robot's "Flat" Vision

The Solution: The "Crystal Ball" Robot

The Secret Sauce: Predicting the Future

How It Works (The Analogy)

The Results

Why This Matters

1. Problem Statement

2. Methodology

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation