Ego-Vision World Model for Humanoid Contact Planning

Imagine a humanoid robot as a clumsy toddler trying to learn how to navigate a chaotic playground. Traditionally, engineers taught these robots to be "perfectly polite"—always avoiding touching anything, like a person trying to walk through a crowded room without brushing against anyone. But in the real world, sometimes you need to touch things to survive: you lean on a wall to stop from falling, you catch a ball to keep it from hitting someone, or you duck under a low branch.

This paper introduces a new way to teach robots these "touchy-feely" skills, not by showing them videos of experts (demonstrations), but by letting them learn from a massive library of random, messy attempts.

Here is the breakdown of their invention, using some everyday analogies:

1. The Problem: The "Blindfolded" Robot

Old methods tried to teach robots using complex math equations (like a physics textbook). These failed because the real world is messy; a slight error in the math meant the robot would fall over.
Other methods used "Reinforcement Learning," where the robot tries things and gets a "good job!" or "ouch!" signal. But this is like teaching a dog to do a backflip by throwing a ball at it a million times. It's incredibly slow, expensive, and the dog only learns that one trick, not how to handle a whole new situation.

2. The Solution: The "Dreaming" Robot

The authors built a World Model. Think of this as the robot's "imagination" or "dreaming" capability.

The Training: Instead of watching a master chef cook, they fed the robot a dataset of thousands of random, clumsy movements it made in a simulation. It's like giving the robot a library of "what happens if I jump, fall, or bump into things" without telling it which ones were good.
The Compression: The robot doesn't memorize every pixel of a video (which is too much data). Instead, it learns to compress the world into "concepts" (latent space). It learns the essence of a wall, a ball, or a balance point, rather than just the colors and shapes.

3. The Secret Sauce: The "Crystal Ball" (Surrogate Value Function)

This is the most clever part. Usually, when a robot plans a move, it asks, "What is the reward?" But in contact tasks, rewards are rare. You only get a "good job" if you successfully catch the ball or don't fall. That's like playing a video game where you only get points at the very end of the level. It's too hard to learn.

The authors gave the robot a Surrogate Value Function.

The Analogy: Imagine you are playing a game of pool. You don't wait until the ball goes in the pocket to know if your shot was good. You have a "gut feeling" (a value function) that tells you, "If I hit the ball here, I'm 90% likely to sink it."
The robot uses this "gut feeling" to guide its planning. It doesn't need to wait for the final result to know if a move is promising. It can simulate 1,000 different futures in its "dream" (latent space) in a split second and pick the one that feels most likely to succeed.

4. The Planner: The "Rehearsal" (MPC)

Once the robot has its "gut feeling" and its "imagination," it uses a technique called Model Predictive Control (MPC).

The Analogy: Think of a jazz musician improvising. They don't just play one note and hope for the best. They think ahead: "If I play this note, the next chord will be X. If I play that, it will be Y." They constantly rehearse the next few seconds of music in their head.
The robot does the same. It looks at the current scene (via its camera), simulates the next 4 steps in its head, picks the best sequence of moves, executes the first step, and then immediately re-evaluates. It's a continuous loop of "Imagine -> Plan -> Act -> Re-imagine."

5. The Results: Agile and Robust

They tested this on a real Unitree G1 humanoid robot. The results were impressive:

The Wall Push: If someone pushes the robot, it instinctively leans its hands against the wall to steady itself, just like a human would.
The Ball Block: If a ball is thrown at it, it doesn't just dodge; it actively blocks the ball with its hand to protect itself.
The Low Arch: It learns to squat down to walk under a low arch without hitting its head.

Why is this a big deal?

No Teachers Needed: They didn't need humans to record perfect videos of how to do these tasks. The robot learned from "random noise."
One Brain, Many Skills: The same robot model learned to balance, block, and duck. It didn't need a separate brain for each task.
Real-Time: It works fast enough to react to sudden changes in the real world.

Summary

In short, this paper teaches robots to be less like rigid machines and more like intuitive humans. Instead of following a strict rulebook, they learn to dream about what might happen, use their instincts (value function) to guess the best move, and rehearse the future constantly to stay balanced and safe in a messy, unpredictable world.

Here is a detailed technical summary of the paper "Ego-Vision World Model for Humanoid Contact Planning."

1. Problem Statement

Humanoid robots currently struggle to transition from simple dynamic locomotion to intelligent interaction in unstructured environments. The core challenge lies in contact-rich planning: enabling robots to purposefully exploit physical contact (e.g., bracing against a wall for balance, blocking objects, or ducking under obstacles) rather than merely avoiding collisions.

Existing approaches face significant limitations:

Optimization-based planners: Struggle with the non-smooth dynamics of real-time contact scheduling and are highly sensitive to model inaccuracies.
On-policy Reinforcement Learning (RL): While successful in simulation, these methods are sample-inefficient (especially with visual inputs), struggle with multi-task learning, and require extensive environment interaction.
Partial Observability: Real-world contact states (e.g., exact contact forces) are not directly observable and are obscured by sensor noise, making it difficult to predict outcomes from raw sensory data.

2. Methodology

The authors propose a framework that integrates a learned visual world model with sampling-based Model Predictive Control (MPC). The system operates entirely on a demonstration-free offline dataset.

A. Data Collection (Offline & Demonstration-Free)

Low-Level Controller: A PPO-trained controller tracks high-level commands (end-effector position, body height) using only proprioceptive feedback.
Dataset Generation: Instead of using expert demonstrations, the authors generate an offline dataset by applying randomly sampled high-level actions to the simulated robot.
Input Data: The robot collects trajectories consisting of ego-centric depth images (64x48), proprioceptive signals, actions, rewards, and termination signals.
Key Constraint: Linear velocity is excluded from the action space to force the robot to solve contact problems through postural manipulation (e.g., leaning or squatting) rather than just moving.

B. Ego-Vision Humanoid World Model

The world model predicts future outcomes in a compressed latent space rather than raw pixels to avoid compounding errors. It consists of:

Recurrent Dynamics: A Recurrent Neural Network (RNN) maintains a deterministic latent state ( $h_t$ ) representing system dynamics.
Stochastic Latent Inference: An encoder infers a stochastic latent state ( $z_t$ ) from the current observation ( $o_t$ ) and $h_t$ .
Prediction Heads:
- Reconstruction: Decodes $z_t$ and $h_t$ to reconstruct the observation ( $\hat{o}_t$ ).
- Termination Probability ( $\hat{d}_t$ ): Predicts the likelihood of failure (e.g., falling).
- Surrogate Value Function ( $\hat{Q}_t$ ): A critical component that directly estimates the expected cumulative return (value) for a candidate action, conditioned on the latent state. This bypasses the difficulty of predicting sparse contact rewards directly from observations.

Training Objective: The model is trained to minimize a total loss comprising:

Reconstruction Loss (NLL + Binary Cross-Entropy for termination).
Joint-Embedding Predictive Loss (KL divergence to ensure consistent latent dynamics).
Surrogate Value Loss (Mean Squared Error against Monte Carlo targets).

C. Value-Guided Sampling MPC

At inference time, the system uses the trained world model for planning:

Latent Rollout: Starting from the current observation, the planner samples candidate action sequences (horizon $N=4$ ).
Surrogate Evaluation: The world model recursively predicts future latent states and evaluates the surrogate value ( $\hat{Q}$ ) for each step in the sequence.
Failure Handling: If the predicted termination probability ( $\hat{d}_t$ ) exceeds a threshold (0.9), the value of that trajectory is zeroed out.
Optimization: The Cross-Entropy Method (CEM) optimizes the action sequence to maximize the average surrogate value over the horizon.
Execution: Only the first action is executed, and the process repeats (receding horizon) to incorporate real-time feedback.

3. Key Contributions

Scalable Visual World Model: A model trained entirely on a demonstration-free offline dataset that captures the dynamics of diverse contact tasks (wall support, object blocking, arch traversal) using ego-centric depth images.
Value-Guided Planning: Introduction of a surrogate value function within an MPC framework. This guides the planner using dense, robust value estimates rather than sparse, noisy rewards, significantly improving planning efficiency.
Real-World Agile Control: Validation on a physical Unitree G1 humanoid robot, demonstrating robust, real-time contact planning solely from proprioception and ego-centric depth, achieving tasks like stabilizing against perturbations and blocking incoming objects.

4. Results

Sample Efficiency: The method achieves high performance using only 0.5M data steps from an offline dataset. In contrast, on-policy PPO requires significantly more data and time, particularly in tasks with complex visual changes (e.g., "Traverse the Arch").
Multi-Task Capability: A single model trained on a mixed dataset of all tasks performs comparably to specialized single-task models, demonstrating effective generalization without catastrophic forgetting.
Ablation Studies:
- Horizon Length: A planning horizon of $N=4$ provides the optimal bias-variance trade-off. Shorter horizons are myopic; longer horizons suffer from accumulated model bias.
- Objective Function: Using the surrogate value ( $\hat{Q}$ ) outperforms using raw rewards (Rew-MPC) or TD-targets (TD-MPC), which are unstable due to partial observability and sparse rewards.
- Architecture: The proposed non-autoregressive latent prediction outperforms auto-regressive pixel prediction baselines (ARWM) for value estimation in offline settings.
Real-World Deployment: The system successfully executed tasks on the physical robot, including:
- Support the Wall: Bracing against a wall when pushed.
- Block the Ball/Box: Intercepting flying objects with hands.
- Traverse the Arch: Squatting to pass under low obstacles.
- Generalization: Successfully blocked a box size not seen during training (Out-of-Distribution).

5. Significance

This work represents a significant step toward autonomous humanoids in unstructured environments. By decoupling the need for expert demonstrations and overcoming the sample inefficiency of on-policy RL, the proposed framework enables robots to learn complex, contact-rich behaviors from random exploration data. The integration of a learned world model with value-guided MPC provides a robust mechanism for handling the partial observability and high-dimensional dynamics inherent in real-world robotic interaction, paving the way for more agile and adaptable physical agents.