WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

The Big Problem: The "Blindfolded Magician"

Imagine you are wearing a GoPro camera on your head. You walk around your kitchen, grab a coffee mug, put it down, pick up a banana, and walk out the door.

To a computer trying to understand this video, it's a nightmare.

The Camera is Moving: The world is spinning because your head is moving, not just the objects.
The "Blind Spots": Your hands often block the view of the objects. Sometimes the object disappears behind your back or leaves the room entirely.
The Disconnect: Most current AI programs try to guess where your hands are or where the object is, but they do it separately. It's like trying to solve a puzzle by looking at the edge pieces and the center pieces in different rooms and then hoping they fit together. Often, the AI thinks your hand is floating in mid-air or that the coffee mug is passing through your hand like a ghost.

The Solution: WHOLE (The "Intuitive Choreographer")

The authors introduce WHOLE (World-Grounded Hand-Object Lifted from Egocentric Videos). Think of WHOLE not as a calculator, but as an intuitive choreographer who has watched millions of videos of people interacting with objects.

Instead of calculating every movement mathematically from scratch, WHOLE uses a "generative prior."

The Analogy: Imagine you are trying to guess the ending of a movie scene where a person picks up a cup. Even if the camera is shaky or the cup is hidden for a second, you (the human) know the cup must be in the hand, not floating three feet away. You have an internal "movie script" of how physics and human hands work.
WHOLE's Superpower: WHOLE has learned this "movie script" (a generative prior) specifically for hands and objects. It knows that if a hand is near a box, the box usually moves with the hand. If the hand lets go, the box stops or falls. It understands the relationship between the two, not just their individual positions.

How It Works: The "Guided Improv"

WHOLE doesn't just guess; it performs a "guided improv" session.

The Rough Draft: It starts with a basic, slightly messy guess of where the hands and objects might be (like a rough sketch).
The "Director's Notes" (Guidance): It looks at the actual video you gave it.
- Visual Cues: "Hey, the video shows the hand is covering the box right now."
- Contact Cues: It uses a smart AI assistant (a Vision-Language Model) to look at the video and say, "I think the fingers are touching the handle."
The Refinement: WHOLE takes its rough draft and tweaks it to match the director's notes, while strictly obeying the laws of physics it learned earlier. It ensures the hand doesn't clip through the table and the object doesn't teleport.

Why It's a Game Changer

Previous methods were like two separate actors trying to memorize their lines without talking to each other.

Old Way: Actor A (Hand) says, "I'm here!" Actor B (Object) says, "I'm over there!" They don't realize they are holding hands, so the video looks weird and glitchy.
WHOLE: The director (WHOLE) tells them, "You are holding the box. Move together." The result is a smooth, realistic 3D movie where the hand and object move as a single, logical unit, even when the camera is shaky or the object disappears for a moment.

The "Magic Trick" of the Future

The paper also shows a cool application: Hand-Guided Planning.
Because WHOLE understands the "rules" of how hands and objects interact, you can give it a rough idea of a hand movement (e.g., "reach for the shelf") and a contact instruction ("pick up the can"), and it can invent a realistic 3D animation of the whole process.

This is huge for robots. Instead of programming a robot with thousands of rigid rules, we can give it a "vibe" (a rough hand path), and WHOLE can fill in the realistic details of how the robot should actually grab and move the object.

Summary

WHOLE is an AI that stops looking at hands and objects as separate things. Instead, it treats them as a team. By learning the "dance" of human interaction, it can reconstruct 3D videos from shaky, first-person camera footage with incredible accuracy, filling in the blanks when things go out of sight or get blocked from view. It turns a chaotic, blurry video into a clear, physically possible 3D story.

1. Problem Definition

The paper addresses the challenge of reconstructing 3D hand articulation and object motion from egocentric (first-person) videos within a consistent, global world coordinate frame.

Core Challenges:
- Severe Occlusions: Hands frequently block objects and vice versa during manipulation.
- Egomotion: The camera moves with the user, causing large camera motion even when objects are stationary.
- Out-of-View Events: Objects frequently enter and exit the camera's field of view.
- Inconsistency: Existing methods typically predict hand and object poses independently, leading to physically implausible interactions (e.g., hands passing through objects or objects floating without contact).
Goal: To holistically reconstruct the 4D trajectories (3D space + time) of both hands and manipulated objects, ensuring physical coherence and world-grounded consistency, given only a metric-SLAMed video and an object template.

2. Methodology: WHOLE

The authors propose WHOLE, a framework that treats reconstruction as a guided generative process. Instead of optimizing poses directly, it learns a generative prior over hand-object interactions and refines it using visual observations.

A. Generative Motion Prior (Training Phase)

The core of the method is a diffusion-based motion prior trained to model the joint dynamics of hands and objects.

Input Conditioning: The model is conditioned on:
1. An approximated hand trajectory ( $\bar{H}$ ) from an off-the-shelf estimator.
2. An object template ( $O$ ) represented by a BPS (Basis Point Set) descriptor.
Output: Refined hand motions ( $H$ ), object 6D trajectories ( $T$ ), and binary contact labels ( $C$ ) for left/right hands.
Representation:
- Hands: Represented via MANO parameters (global orientation, translation, articulation, shape, joint positions/velocities).
- Objects: Represented as SE(3) transforms using a 9D formulation.
- Contact: A per-time "Ambient Sensor" feature measures the displacement of hand joints to the nearest point on the object surface to capture fine-grained contact relationships.
Coordinate Frame: The model operates in a gravity-aware local frame (aligned with the up-vector) to focus on relative motion, which is later transformed back to world coordinates.
Training Strategy: The model is trained on the HOT3D dataset. To ensure robustness, the training data includes synthesized noisy hand tracks (perturbed MANO parameters) to prevent overfitting to specific estimators. The loss function combines standard DDPM denoising loss with auxiliary losses for interaction realism, kinematic consistency, and temporal smoothness.

B. Guided Generation (Inference/Reconstruction Phase)

At test time, the pretrained prior is guided by visual observations to generate trajectories consistent with the input video.

Classifier Guidance: Instead of slow score distillation sampling (SDS), WHOLE uses classifier guidance to steer the diffusion process. It modifies the score gradient ( $\nabla \log p(x_n)$ ) by subtracting the gradient of task-specific objectives ( $g$ ).
Guidance Signals:
1. 2D Masks: Segmentation masks for hands and objects.
2. Contact Cues: Binary contact labels derived from a Vision-Language Model (VLM). The VLM is enhanced with spatial prompting (overlaying masks) and in-context learning to robustly detect contact even in cluttered scenes.
Optimization Objectives: The guidance step minimizes:
- Reprojection Loss: Aligns generated 3D poses with 2D masks and joints.
- Interaction Loss: Enforces realistic dynamics (e.g., rigid transport during contact, no motion without contact).
- Temporal Smoothness: Ensures smooth trajectories.
Long Video Handling: For sequences longer than the model's window (120 frames), the method uses overlapping sliding windows with a blending strategy to ensure temporal continuity.

3. Key Contributions

Holistic Reconstruction Framework: WHOLE is the first method to jointly reconstruct hand and object motions in a global world frame from egocentric videos, moving beyond isolated pose estimation.
Generative Motion Prior: Introduces a diffusion-based prior that learns the complex mutual dynamics and contact relationships between hands and objects, enabling the generation of physically plausible interactions.
VLM-Enhanced Contact Detection: Demonstrates that a Vision-Language Model, augmented with spatial prompts, can provide contact labels comparable to ground truth, effectively guiding the reconstruction without manual annotation.
Guided Generation Paradigm: Proposes a test-time guidance mechanism that balances the learned generative prior with visual observations, outperforming naive combinations of state-of-the-art estimators followed by post-optimization.

4. Experimental Results

The method was evaluated on the HOT3D dataset, comparing against state-of-the-art baselines (HaWoR for hands, FoundationPose for objects) and combined baselines.

Hand Motion: WHOLE achieves state-of-the-art performance in global alignment (W-MPJPE) and temporal smoothness (ACC-NORM), outperforming specialized hand estimators like HaWoR.
Object Motion: WHOLE significantly outperforms FoundationPose and combined baselines, particularly in challenging subsets:
- Truncated/Out-of-View: WHOLE infers plausible trajectories even when objects are partially or fully occluded, whereas baselines often fail or produce "floating" objects.
- Contact: High accuracy in maintaining physical consistency during interaction.
Interaction Quality: When evaluating the relative spatial alignment of hands and objects, WHOLE shows a large margin of improvement over baselines, demonstrating superior capability in capturing coherent hand-object coupling.
Ablation Studies:
- VLM vs. Ground Truth: VLM-derived contact labels achieve performance close to ground-truth labels, validating the spatial prompting approach.
- Guidance Necessity: Alternating diffusion steps with guidance (rather than post-hoc optimization) is crucial for keeping samples within the data manifold.
- Generalization: WHOLE shows robust zero-shot generalization to the unseen H2O dataset, whereas appearance-based baselines collapse.

5. Significance and Impact

Robotics & Learning from Demonstration: By providing consistent, world-grounded 4D trajectories of hand-object interactions, WHOLE enables robots to learn manipulation skills from first-person human demonstrations more effectively.
AR/VR: The ability to reconstruct realistic interactions in global space is critical for immersive augmented and virtual reality environments.
Methodological Shift: The paper shifts the paradigm from "detect then optimize" to "generate then guide," proving that learning a joint generative prior is more effective for handling occlusions and maintaining physical consistency than independent estimation pipelines.
Scalability: The framework is designed to be flexible, allowing for future extensions to multi-object scenes and template-free scenarios.