InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

Imagine teaching a robot to do chores, like picking up a heavy box or pushing a cart. In the past, robots were great at walking or dancing on their own, but the moment they had to touch something and move it, they would often drop it, knock it over, or get stuck. They lacked the "feel" for how objects behave in the real world.

This paper introduces InterReal, a new "training camp" for humanoid robots designed to master these tricky tasks. Think of it as a three-step recipe to turn a clumsy robot into a skilled mover.

1. The "What-If" Simulator (Motion Augmentation)

Imagine you are learning to catch a ball. If you only practice with the ball thrown from the exact same spot every time, you'll get good at that specific throw, but you'll fail if someone throws it slightly to the left.

The researchers realized that real-world robots fail because sensors get noisy, or the box isn't exactly where the robot expects it to be. To fix this, InterReal uses a technique called Motion Augmentation.

The Analogy: Think of a dance instructor teaching a student. Instead of just showing one perfect routine, the instructor says, "Okay, now imagine the music is slightly faster," or "Now, pretend the floor is slippery," or "Now, the partner is standing two inches to the left."
How it works: The system takes a perfect video of a human moving a box and mathematically "shuffles" the box's position slightly in thousands of different ways. It then forces the robot to figure out how to adjust its arm joints to still grab the box perfectly, even if the box moved. This teaches the robot to be flexible rather than rigid.

2. The "Smart Coach" (Automatic Reward Learning)

In robot training, you have to give the robot points (rewards) for doing things right. Usually, humans have to manually decide: "Okay, if the robot holds the box, give 5 points. If it drops it, minus 10." This is like a teacher trying to grade a student by hand-cranking a calculator for every single test question. It's slow, and the teacher might get the balance wrong (e.g., caring too much about walking fast and not enough about holding the box).

InterReal introduces an Automatic Reward Learner, which acts like a Smart Coach.

The Analogy: Imagine a coach who watches the robot train and says, "Right now, the robot is wobbling, so let's focus the points on balance. But now that it's stable, let's focus the points on gripping the box tightly."
How it works: The system uses a "Meta-Policy" (a higher-level brain) that watches the robot's mistakes. If the robot is struggling to keep the box steady, the coach automatically shifts the reward system to prioritize stability. If the robot is stable but missing the box, the coach shifts the focus to accuracy. It constantly re-tunes the "grading rubric" in real-time to help the robot learn faster and better.

3. The "Real-World Test" (Deployment)

Most robot training happens in a video game world (simulation) where physics are perfect. The problem is, real life is messy.

InterReal was tested on a real robot called the Unitree G1.

The Scenario: The robot had to pick up a heavy box and walk with it, and then push a box forward while bending over.
The Result: While other robots (the "baselines") would often drop the box or fall over when the box moved slightly, InterReal adjusted on the fly. It used a camera system to see where the box actually was and tweaked its movements instantly, just like a human would.

The Big Picture

Think of InterReal as the difference between a parrot and a human.

Old Robots (Parrots): They memorized a specific script. If you changed the script (moved the box), they couldn't adapt.
InterReal (Human): It learned the principles of physics and interaction. It understands that if the box moves, I need to move my hand. It learned to adapt to chaos, making it ready for real-world jobs like warehouse work or helping people at home.

In short: InterReal teaches robots to be adaptable by practicing with "what-if" scenarios and using a smart coach that knows exactly what to praise at every moment, resulting in a robot that can actually handle real-world messiness.

Here is a detailed technical summary of the paper "InterReal: A Unified Physics-Based Imitation Framework for Learning Human–Object Interaction Skills."

1. Problem Statement

While Deep Reinforcement Learning (DRL) has enabled humanoid robots to perform complex whole-body locomotion (walking, jumping), Human-Object Interaction (HOI) remains a significant challenge. Existing frameworks suffer from three main limitations:

Lack of Interaction: Most policies focus on non-interactive locomotion and fail to handle the complex constraints of manipulating objects.
Simulation-to-Reality Gap: Prior animation-based HOI methods (e.g., InterMimic) often rely on idealized physics, ignoring real-world contact constraints, friction, and mass, making direct deployment difficult.
Reward Design Bottleneck: Designing reward functions for complex HOI tasks is notoriously difficult. Balancing multiple heterogeneous reward signals (e.g., tracking joint angles vs. object position) manually is tedious, suboptimal, and fails to adapt to different motion phases.

The goal is to develop a unified framework that enables humanoid robots to learn fine-grained, physics-compliant HOI skills (like picking and pushing boxes) and deploy them robustly in the real world.

2. Methodology: InterReal Framework

InterReal is a physics-based imitation learning framework designed to bridge the gap between motion capture data and real-world robotic control. It consists of three core stages:

A. Motion Data Preprocessing & Augmentation

Retargeting & Verification: Mocap data (SMPL format) is retargeted to the Unitree G1 robot shape. Crucially, a physics-based verification step (using IsaacGym) ensures the motions satisfy kinematic and dynamic constraints, eliminating artifacts like penetration or invalid contacts.
HOI Motion Augmentation: To improve robustness against object perturbations, the framework generates multiple training trajectories from a single "anchor" motion.
- It applies random offsets ( $\Delta x, \Delta y$ ) to the object's position in the world coordinate system.
- Inverse Kinematics (IK) is used to solve for new arm joint positions that maintain the exact same hand-object contact details as the anchor motion.
- This creates a dataset where the robot must learn to interact with the object from various relative positions while preserving contact constraints.

B. Multi-Loop Training Architecture

The learning process is structured into two iterative loops:

Inner-Loop (HOI Policy Learning):
- Uses Proximal Policy Optimization (PPO) to learn the policy $\pi_{hoi}$ that tracks reference motions.
- Asymmetric Actor-Critic: The Critic has access to "privileged" information (perfect object state, interaction graphs), while the Actor only receives "imperfect" observations (simulating real-world sensor noise and lack of object velocity/rotation data). This distills robustness into the policy.
- Interaction-Aware Reward: Includes a specific reward term based on an interaction graph to enforce accurate contact between robot links and the object.
Outer-Loop (Automatic Reward Learning):
- Addresses the difficulty of manually tuning reward weights.
- A Meta-Policy (implemented via Soft Actor-Critic, SAC) learns to dynamically adjust the weights ( $\Theta$ ) of the inner-loop reward function.
- Guidance Signal: The meta-policy is guided by tracking error metrics (joint position, object position, link position). If errors increase in a specific dimension, the meta-policy explores increasing the weight of the corresponding reward term to correct the behavior.
- This allows the system to automatically balance competing objectives (e.g., stability vs. object tracking) as the motion phase evolves.

C. Real-World Deployment

The trained policy is deployed on the Unitree G1 humanoid robot.
FoundationPose is used for real-time object pose estimation.
The system handles real-world delays and sensor noise by relying on the robustness gained from motion augmentation and the asymmetric actor-critic design.

3. Key Contributions

Unified Physics-Based HOI Framework: InterReal is the first framework to combine motion imitation, physics constraints, and real-world deployment for complex HOI tasks, moving beyond idealized animation physics.
Contact-Preserving Motion Augmentation: A novel data augmentation technique that varies object positions while strictly maintaining hand-object contact constraints via IK, significantly improving generalization to object perturbations.
Automatic Reward Learner: A meta-learning approach that replaces manual reward weighting with a learned, adaptive mechanism. It dynamically optimizes reward weights based on real-time tracking errors, solving the "hard trade-off" problem in HOI reward design.
Real-World Validation: Successful deployment on a real humanoid robot (Unitree G1) for box-picking and box-pushing tasks, demonstrating robustness to object posture changes and sensor noise.

4. Experimental Results

The framework was evaluated on Box-Picking and Box-Pushing tasks against baselines (ASAP* and InterMimic*), which were adapted to include similar domain randomization settings.

Tracking Accuracy: InterReal achieved the lowest tracking errors across all metrics (joint angles, link positions, object position, and rotation).
- Example: In Box-Picking, InterReal reduced the Mean Upper Link Position Error ( $E_{mulpe}$ ) to 0.0028m, compared to 0.0039m (ASAP*) and 0.0043m (InterMimic*).
Task Success Rate: InterReal achieved the highest success rates in simulation:
- Box-Picking: 96.41% (vs. 84.72% for InterMimic* and 77.38% for ASAP*).
- Box-Pushing: 87.45% (vs. 79.10% for InterMimic* and 70.63% for ASAP*).
Ablation Studies:
- Removing the automatic reward component (setting $\delta=0$ ) resulted in the worst performance, confirming the necessity of adaptive reward weighting.
- The meta-policy successfully learned to shift reward focus (e.g., prioritizing balance early in a lift, then object tracking later) as shown in the adaptive weight curves.

5. Significance

InterReal represents a significant step forward in humanoid robotics by demonstrating that physics-based imitation learning can be successfully transferred to the real world for complex interaction tasks.

Autonomy: It reduces reliance on teleoperation, allowing robots to autonomously adapt to object placement variations.
Scalability: The automatic reward learning mechanism offers a scalable solution for designing rewards in other complex robotic tasks where manual tuning is impractical.
Robustness: The combination of contact-preserving augmentation and asymmetric training provides a blueprint for handling the sim-to-real gap in interaction-heavy scenarios.

The authors acknowledge that future work is needed to further mitigate object position perturbations caused by high-variance real-world object detection, but InterReal establishes a strong foundation for robust, interactive humanoid control.

InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

1. The "What-If" Simulator (Motion Augmentation)

2. The "Smart Coach" (Automatic Reward Learning)

3. The "Real-World Test" (Deployment)

The Big Picture

1. Problem Statement

2. Methodology: InterReal Framework

A. Motion Data Preprocessing & Augmentation

B. Multi-Loop Training Architecture

C. Real-World Deployment

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities