Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Imagine you are teaching a robot how to push a heavy, oddly shaped box across a table. You can't just tell the robot, "Push it hard," because if the box is heavy on one side, it might tip over. If it's light, you might push it too hard and send it flying. The robot needs to know the physics of the object: where its weight is concentrated (its center of mass) and how slippery it is.

The problem is, robots usually learn in a perfect video game world (simulation) and then try to do the task in the real world. But the real world is messy, and the robot often gets it wrong because the "game physics" don't match the "real physics."

Phys2Real is a new method that helps the robot bridge this gap. Think of it as a three-step training program that combines guessing, learning, and trusting.

The Three-Step "Phys2Real" Recipe

1. The Digital Twin (Real-to-Sim)

First, the robot needs a perfect digital copy of the object to practice on.

The Analogy: Imagine taking a photo of a real hammer and using magic software to turn it into a 3D video game model that looks exactly like the real thing, down to the texture and shape.
What they did: They used a camera to take pictures of the object and special AI (called Gaussian Splatting) to build a perfect 3D model. This becomes the "training gym" for the robot.

2. The Two-Brain Strategy (The Core Innovation)

This is the clever part. The robot uses two different "brains" to figure out the object's physics, and then it combines them.

Brain A: The Visual Expert (The VLM)
- Who it is: This is a Vision-Language Model (like a super-smart AI that can see and read).
- What it does: Before the robot even touches the object, the AI looks at a picture and says, "Based on how heavy that hammer head looks, I'd guess the center of weight is here. I'm about 80% sure."
- The Metaphor: This is like you looking at a suitcase and guessing, "That looks heavy on the bottom, so I should lift it from the top." It's a good guess, but it's just a guess.
Brain B: The Tactile Learner (The RL Policy)
- Who it is: This is the robot's reinforcement learning brain, trained in the simulation.
- What it does: As the robot starts pushing the object, it feels how the object reacts. If the object tips to the left, the robot learns, "Oh, the weight is actually on the right!"
- The Metaphor: This is like you actually picking up the suitcase and feeling it tilt. You adjust your grip based on what you feel.

3. The "Uncertainty" Mixer (The Fusion)

Here is the magic sauce. The robot doesn't just pick one brain; it listens to both, but it weighs their opinions based on confidence.

The Analogy: Imagine you are trying to find a lost item.
- Brain A (Visual) says: "I think it's in the kitchen, but I'm not 100% sure."
- Brain B (Touch) says: "I haven't felt it yet, so I have no idea where it is."
- The Result: The robot listens mostly to Brain A because Brain B is clueless right now.
- Later: After 10 seconds of pushing, Brain B says: "I definitely feel it's heavy on the left side!" Now, Brain B is very confident. The robot switches to listening mostly to Brain B.

The system constantly asks: "Who is more sure right now?" If the robot is just starting and hasn't touched the object, it trusts the Visual Expert. Once the robot starts interacting and gathering data, it trusts the Tactile Learner.

Why This Matters (The Results)

The researchers tested this on two tasks:

Pushing a T-shaped block: They added a heavy weight to the top or bottom to change how it moved.
- Old Way (Domain Randomization): The robot tried to be "average" and handle everything. It failed often (only 23% success on the hard version).
- Phys2Real: The robot used the AI's guess to start, then refined it as it pushed. It succeeded 57% of the time on the hard version and 100% on the easy version.
Pushing a Hammer: They built the hammer's 3D model from scratch using photos.
- Result: The robot finished the task 15% faster than the old methods because it didn't waste time guessing; it knew exactly how the hammer would behave.

The Big Picture

Phys2Real is like giving a robot a "gut feeling" (from the AI looking at the object) and then teaching it to "trust its hands" (from the physical interaction). By mixing these two sources of information and knowing when to trust which one, the robot can handle new, weird objects it has never seen before, making it much smarter and safer for real-world jobs.

Here is a detailed technical summary of the paper "Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation."

1. Problem Statement

Deploying robotic manipulation policies trained in simulation to the real world (Sim-to-Real) is a fundamental challenge, particularly for tasks requiring precise physical dynamics (e.g., pushing objects with varying mass distributions).

Limitations of Current Methods:
- Domain Randomization (DR): While robust to a range of variations, DR policies often learn "averaged" behaviors that fail to adapt to specific, out-of-distribution object properties, leading to suboptimal performance.
- System Identification: Traditional methods often require manual tuning or static models that cannot adapt online to changing conditions.
- Rapid Motor Adaptation (RMA): Existing online adaptation methods (like RMA) rely on interaction history to infer latent states. However, in non-prehensile tasks (like pushing) with intermittent contact, interaction histories can be uninformative, leading to poor parameter estimates.
- Vision-Language Models (VLMs): While VLMs can estimate physical properties from images, they are typically used for high-level planning. Their estimates are often uncertain and not directly integrated into low-level closed-loop control.

Core Question: Can combining visual physical reasoning (VLM priors) with interactive learning (online adaptation) improve robot manipulation performance in real-world environments after training in simulation?

2. Methodology: The Phys2Real Pipeline

Phys2Real is a Real-to-Sim-to-Real framework consisting of three core stages:

Stage I: Real-to-Sim (Geometric Reconstruction)

To create high-fidelity simulation assets for objects without known meshes:

Video Capture & Segmentation: Images of the object are captured and segmented using SAM-2.
3D Reconstruction: A 3D Gaussian Splatting (GSplat) model is trained on the segmented foreground.
Mesh Extraction: A surface-aligned mesh is extracted using SuGaR (Surface-Aligned Gaussian Splatting).
Refinement: The mesh is cleaned and made watertight (e.g., via mirroring for symmetric objects and Marching Cubes) to serve as a simulation-ready asset.

Stage II: Policy Learning (Physics-Conditioned RL)

The authors train a Reinforcement Learning (RL) policy conditioned on interpretable physical parameters (e.g., Center of Mass - CoM) rather than latent vectors. The training follows a three-phase approach inspired by RMA:

Phase 1 (Supervised): The policy is trained in simulation conditioned on ground-truth physical parameters.
Phase 1.5 (Robustness): The policy is fine-tuned with noisy physical parameters (Gaussian noise) to make it robust to estimation errors at deployment.
Phase 2 (Adaptation Model): The policy weights are frozen. An ensemble of $M$ $M$ adaptation models is trained to predict physical parameters from a history of observations and actions (sliding window).
- Uncertainty Quantification: The ensemble provides two types of uncertainty:
  - Epistemic Uncertainty: Variance across the ensemble members (model disagreement).
  - Aleatoric Uncertainty: Variance within individual model predictions (data noise).

Stage III: Sim-to-Real Transfer (Uncertainty-Aware Fusion)

At deployment, the system fuses two sources of information to estimate physical parameters:

VLM Prior ( $\theta_{vlm}$ ): A Vision-Language Model (GPT-5) queries images of the object to estimate the physical parameter (e.g., CoM) and its self-reported uncertainty ( $\sigma_{vlm}$ ).
Interaction Estimate ( $\theta_{rma}$ ): The trained adaptation ensemble estimates the parameter from real-time interaction history, providing an uncertainty estimate ( $\sigma_{rma}$ ).
Fusion Mechanism: The estimates are combined using Inverse-Variance Weighting:
$\hat{\theta} = \frac{\theta_{vlm}/\sigma_{vlm}^2 + \theta_{rma}/\sigma_{rma}^2}{1/\sigma_{vlm}^2 + 1/\sigma_{rma}^2}$
- Logic: If the interaction history is uncertain (high $\sigma_{rma}$ ), the system relies more on the VLM prior. If the VLM is uncertain (high $\sigma_{vlm}$ ), the system relies more on the interaction data. This creates a "digital twin" that is both geometrically and physically informed.

3. Key Contributions

Uncertainty-Aware Fusion: The first framework to explicitly fuse VLM-inferred physical priors with online interaction-based adaptation using inverse-variance weighting. This addresses the "intermittent contact" problem where interaction data alone is insufficient.
Interpretable Parameter Conditioning: Unlike RMA which uses latent vectors, Phys2Real conditions policies directly on interpretable physical parameters (e.g., CoM), enabling direct combination with VLM outputs.
Physically-Informed Digital Twins: A pipeline combining 3D Gaussian Splatting reconstruction with online physical property estimation, creating simulation assets that are both geometrically accurate and dynamically tuned.
Novel Application of VLMs: Demonstrates the use of VLMs not just for high-level planning, but for providing calibrated, low-level physical priors for closed-loop control.

4. Experimental Results

The method was evaluated on a 6-DOF UFactory xArm robot using two planar pushing tasks: T-block pushing (varying CoM) and Hammer pushing (off-center mass).

Key Findings:

T-Block (Weight at Top - Challenging):
- Phys2Real: 57.14% success rate.
- Domain Randomization (DR): 23.81% success rate.
- VLM-only: 4.76% success rate.
- RMA-only: 14.29% success rate.
- Insight: Neither VLM nor RMA alone is sufficient; fusion is critical for difficult dynamics.
T-Block (Weight at Bottom - Easier):
- Phys2Real: 100% success rate.
- DR: 79.17% success rate.
- Privileged Oracle (Ground Truth): 95.83% success rate.
- Insight: Phys2Real matches the performance of an oracle with ground-truth physics without actually having it.
Hammer Pushing:
- Both Phys2Real and DR achieved 100% success.
- Efficiency: Phys2Real completed tasks 15% faster (77.79s vs 90.65s) than DR, demonstrating more efficient trajectory generation.
Ablation Studies: Confirmed that the combination of VLM and interaction data is essential. Removing either component led to significant performance drops, especially in high-uncertainty scenarios.

5. Significance and Impact

Bridging the Sim-to-Real Gap: Phys2Real offers a scalable alternative to Domain Randomization by enabling policies to adapt to specific object properties rather than averaging over them.
Efficiency: By leveraging VLM priors, the robot requires less exploration time to learn object dynamics, reducing the number of real-world interactions needed for successful deployment.
Generalization: The framework is designed to handle objects with unknown meshes and varying physical properties, moving towards more general robotic manipulation systems.
Future Direction: This work paves the way for integrating foundation models (VLMs) with physical grounding, suggesting a future where robots use semantic knowledge to initialize physical reasoning, which is then refined through real-world interaction.

In conclusion, Phys2Real successfully demonstrates that fusing visual priors from foundation models with interactive adaptation allows robots to achieve near-oracle performance in real-world manipulation tasks without requiring ground-truth physical models.