Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
This paper proposes a bimanual manipulation framework that leverages a pre-trained 3D geometric foundation model to fuse RGB-based 3D latents, 2D semantics, and proprioception within a diffusion policy, enabling the joint prediction of actions and future 3D scene evolution to achieve state-of-the-art performance without relying on explicit point clouds.