D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping

Imagine you are trying to teach a robot to pick up a heavy jar of pickles. If you just tell the robot "grab it," it might squeeze too hard and crush the jar, or too lightly and drop it. The problem is, the robot doesn't know how heavy the jar is, how slippery the label is, or how the jar will wobble when it moves.

Usually, engineers try to teach robots in a video game simulation first. They build a digital world where the robot practices. But there's a catch: Simulations are often "fake." In the game, the jar might be made of "digital plastic" that weighs nothing, while in real life, it's heavy glass. When the robot tries to use its game skills in the real world, it fails because the physics don't match. This is called the "Sim-to-Real Gap."

This paper introduces D-REX, a clever new system that acts like a super-smart translator between the real world and the video game world. Here is how it works, broken down into simple steps:

1. The "Digital Twin" Builder (Real-to-Sim)

Imagine you take a video of a real object (like a cookie or a ketchup bottle) with your phone. D-REX uses this video to build a 3D digital copy (a "Digital Twin") of that object.

The Magic: It doesn't just make it look real; it makes it feel real. It uses a special technology called Gaussian Splatting (think of it as millions of tiny, glowing 3D pixels) to capture the shape and texture perfectly.
The Goal: To create a simulation that looks exactly like your kitchen table.

2. The "Weight Detective" (Mass Identification)

This is the paper's biggest breakthrough. In a normal video game, you have to guess how heavy an object is. D-REX doesn't guess; it solves a mystery.

How it works: The system watches a robot push the object in the real world. Then, it runs a simulation where it tries to push the digital twin with the exact same force.
The "Aha!" Moment: If the digital object slides too fast, the system knows, "Oops, I made it too light!" It automatically adjusts the weight in the simulation and tries again. It does this thousands of times per second until the digital object moves exactly like the real one.
The Result: The robot now knows the exact weight of the object without ever needing a scale. It has "learned" the physics of the real world just by watching.

3. The "Human-to-Robot" Translator

Once the robot knows the object's weight and shape, it needs to learn how to grab it.

The Problem: Humans have soft, flexible hands. Robots have stiff metal fingers. You can't just copy a human's hand movements directly; the robot might break the object or drop it.
The Solution: D-REX watches videos of humans grabbing things. It then translates those human movements into robot commands.
The Secret Sauce: Because the robot now knows the exact weight (from Step 2), it can adjust its grip strength.
- Analogy: Imagine holding a feather vs. holding a brick. You use a gentle touch for the feather and a firm grip for the brick. D-REX teaches the robot to do this automatically. If the robot thinks the object is light, it squeezes gently. If it realizes the object is heavy, it squeezes harder to stop it from slipping.

4. The "Real-to-Real" Loop

Finally, the robot takes what it learned in the simulation and goes back to the real world to do the job.

Because the simulation was so accurate (thanks to the weight detective), the robot's skills transfer perfectly. It doesn't need to practice for weeks in the real world; it just shows up and grabs the object successfully.

Why is this a big deal?

No More Guessing: Before, robots often failed because they didn't know if an object was heavy or light. D-REX figures it out instantly.
Learning from Humans: It lets robots learn from YouTube-style videos of people doing tasks, rather than requiring engineers to manually program every single movement.
Safety: By knowing the weight, the robot won't crush fragile items or drop heavy ones.

In short: D-REX is like giving a robot a pair of "X-ray glasses" that let it see the invisible weight of objects, and a "universal translator" that turns human videos into robot skills. It bridges the gap between the fake world of simulations and the messy, heavy world of reality, making robots much better at picking up things without breaking them.

1. Problem Statement

The paper addresses the critical sim-to-real gap in robotics, specifically the challenge of transferring policies trained in simulation to the real world. A primary bottleneck is the lack of accurate physical parameter identification (e.g., object mass, friction) from visual observations alone.

The Core Issue: Standard simulation assumes perfect knowledge of physical parameters. In reality, estimated geometry and mass from visual data often deviate from ground truth, leading to unstable grasps (e.g., slippage on heavy objects or bouncing off light ones) when policies are deployed.
The Limitation of Existing Methods: Current approaches either rely on domain randomization (which is data-inefficient and struggles with out-of-distribution masses) or system identification methods that are not differentiable or require extensive manual tuning and ground-truth sensors (like torque sensors).

2. Methodology: D-REX Framework

The authors propose D-REX, a differentiable Real-to-Sim-to-Real engine that bridges the gap by learning physical parameters directly from visual and control data. The framework operates in four integrated stages:

A. Real-to-Sim Initialization (Visual & Geometric Reconstruction)

Input: RGB videos of the scene, the object, and human demonstrations.
Technique: The system uses Gaussian Splatting (specifically 3DGS for appearance and 2DGS with surface normal estimation for geometry) to reconstruct high-fidelity digital twins.
Output: A photorealistic rendering capability and a collision mesh ( $K$ ) compatible with physics engines. This creates the initial simulation environment in MJCF (MuJoCo) format.

B. Mass Identification via Differentiable Physics

Core Innovation: Instead of guessing mass, D-REX optimizes it.
Process:
1. The system executes identical robot control signals in both the real world and the simulation.
2. It tracks the object's 6-DoF pose in the real world using FoundationPose.
3. It simulates the same action in a differentiable physics engine (combining Brax/MJX for kinematics and GradSim for differentiable dynamics).
4. Optimization: The object mass ( $m$ ) is treated as a learnable parameter. The system minimizes the trajectory loss ( $L_{traj}$ ) between the real-world object trajectory and the simulated trajectory using backpropagation.
5. Dynamics Model: Uses a semi-implicit Euler integration scheme with a compliant contact model to ensure stability and accurate gradient flow through contact events.

C. Transferring Human Demonstrations

Challenge: Humans and robots have different embodiments (hand structures).
Solution: The system uses models like HaMeR and MCC-HO to reconstruct 3D human hand-object interactions from videos. These are then retargeted to the robotic hand using Dex-Retargeting, converting human poses into executable robot joint trajectories ( $A_t$ ).

D. Force-Aware Policy Learning

Policy Architecture: A neural network ( $\pi_\phi$ ) takes the reconstructed object mesh vertices and the identified mass ( $m$ ) as input.
Outputs: The policy predicts:
1. Grasp Position: Joint angles for the robotic hand.
2. Contact Constraints: Feasibility of the grasp.
3. Grasping Force: A force command explicitly conditioned on the identified mass ( $\hat{f} \propto m \cdot g$ ).
Training Strategy: A two-stage process involving supervised learning on human data followed by simulation-based refinement (RL-style) to adapt to the specific physics of the identified object.

3. Key Contributions

Differentiable Real-to-Sim-to-Real Engine: A unified framework that enables end-to-end optimization of object mass directly from visual observations and robot control signals without requiring ground-truth mass or torque sensors.
Mass-Conditioned Force Control: A novel policy learning approach that integrates the identified mass into the control loop, allowing the robot to dynamically adjust grasp forces based on the object's physical properties.
Gaussian Splatting for Physics: The effective use of Gaussian Splat representations to generate both photorealistic renderings and collision meshes suitable for differentiable physics simulation.
Human-to-Robot Transfer: A pipeline that successfully translates human demonstration videos into simulation-executable robot trajectories, conditioned on the specific physical properties of the target object.

4. Experimental Results

The authors evaluated D-REX on various objects with diverse geometries and masses (ranging from ~50g to ~700g).

Mass Identification Accuracy:
- Achieved low percentile errors (4.8% – 12.0%) across different object shapes.
- Successfully distinguished between objects with identical geometry but different internal densities (e.g., 3D printed objects with varying infill), with mass deviations under 13 grams.
- Ablation studies showed that Semi-Implicit Euler integration significantly outperformed Explicit Euler in stability and accuracy for mass identification.
Grasping Performance:
- Mass-Mismatch Sensitivity: Policies trained without accurate mass identification failed on objects with masses outside their training distribution (e.g., a policy trained on a light object failed on a heavy one due to insufficient force).
- Force-Aware Success: D-REX policies, conditioned on the identified mass, achieved high success rates (up to 95%) across a wide mass range.
- Comparison: D-REX significantly outperformed baselines like DexGraspNet 2.0 and Human2Sim2Robot, particularly on heavier objects where baselines suffered from slip due to fixed/weak force control.
Efficiency:
- Offline reconstruction takes ~30-35 minutes per object.
- Mass identification converges in ~200 epochs (5-20 minutes).
- Inference is real-time (~0.5 seconds per grasp).

5. Significance

D-REX represents a significant step toward robust, generalizable robotic manipulation in unstructured environments.

Data Efficiency: By leveraging human videos and differentiable physics, it reduces the reliance on massive datasets of robot-collected data or hand-engineered rewards.
Physical Grounding: It moves beyond purely visual imitation learning by explicitly modeling and learning the underlying physics (mass), which is crucial for tasks involving contact, friction, and gravity.
Scalability: The ability to automatically build high-fidelity digital twins with accurate physical parameters enables the deployment of dexterous grasping policies on new objects without retraining the entire system from scratch, simply by re-identifying the mass.

In summary, D-REX solves the "sim-to-real" gap not just by randomizing parameters, but by learning the true physics of the environment, enabling robots to grasp objects with the appropriate force regardless of their weight or material.