WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

WHOLE is a novel method that holistically reconstructs hand and object motion in world space from challenging egocentric videos by leveraging a learned generative prior to jointly reason about their interactions, thereby achieving state-of-the-art performance in handling occlusions and ensuring consistent hand-object relations.

Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu

Published 2026-02-26
📖 4 min read☕ Coffee break read

The Big Problem: The "Blindfolded Magician"

Imagine you are wearing a GoPro camera on your head. You walk around your kitchen, grab a coffee mug, put it down, pick up a banana, and walk out the door.

To a computer trying to understand this video, it's a nightmare.

  1. The Camera is Moving: The world is spinning because your head is moving, not just the objects.
  2. The "Blind Spots": Your hands often block the view of the objects. Sometimes the object disappears behind your back or leaves the room entirely.
  3. The Disconnect: Most current AI programs try to guess where your hands are or where the object is, but they do it separately. It's like trying to solve a puzzle by looking at the edge pieces and the center pieces in different rooms and then hoping they fit together. Often, the AI thinks your hand is floating in mid-air or that the coffee mug is passing through your hand like a ghost.

The Solution: WHOLE (The "Intuitive Choreographer")

The authors introduce WHOLE (World-Grounded Hand-Object Lifted from Egocentric Videos). Think of WHOLE not as a calculator, but as an intuitive choreographer who has watched millions of videos of people interacting with objects.

Instead of calculating every movement mathematically from scratch, WHOLE uses a "generative prior."

  • The Analogy: Imagine you are trying to guess the ending of a movie scene where a person picks up a cup. Even if the camera is shaky or the cup is hidden for a second, you (the human) know the cup must be in the hand, not floating three feet away. You have an internal "movie script" of how physics and human hands work.
  • WHOLE's Superpower: WHOLE has learned this "movie script" (a generative prior) specifically for hands and objects. It knows that if a hand is near a box, the box usually moves with the hand. If the hand lets go, the box stops or falls. It understands the relationship between the two, not just their individual positions.

How It Works: The "Guided Improv"

WHOLE doesn't just guess; it performs a "guided improv" session.

  1. The Rough Draft: It starts with a basic, slightly messy guess of where the hands and objects might be (like a rough sketch).
  2. The "Director's Notes" (Guidance): It looks at the actual video you gave it.
    • Visual Cues: "Hey, the video shows the hand is covering the box right now."
    • Contact Cues: It uses a smart AI assistant (a Vision-Language Model) to look at the video and say, "I think the fingers are touching the handle."
  3. The Refinement: WHOLE takes its rough draft and tweaks it to match the director's notes, while strictly obeying the laws of physics it learned earlier. It ensures the hand doesn't clip through the table and the object doesn't teleport.

Why It's a Game Changer

Previous methods were like two separate actors trying to memorize their lines without talking to each other.

  • Old Way: Actor A (Hand) says, "I'm here!" Actor B (Object) says, "I'm over there!" They don't realize they are holding hands, so the video looks weird and glitchy.
  • WHOLE: The director (WHOLE) tells them, "You are holding the box. Move together." The result is a smooth, realistic 3D movie where the hand and object move as a single, logical unit, even when the camera is shaky or the object disappears for a moment.

The "Magic Trick" of the Future

The paper also shows a cool application: Hand-Guided Planning.
Because WHOLE understands the "rules" of how hands and objects interact, you can give it a rough idea of a hand movement (e.g., "reach for the shelf") and a contact instruction ("pick up the can"), and it can invent a realistic 3D animation of the whole process.

This is huge for robots. Instead of programming a robot with thousands of rigid rules, we can give it a "vibe" (a rough hand path), and WHOLE can fill in the realistic details of how the robot should actually grab and move the object.

Summary

WHOLE is an AI that stops looking at hands and objects as separate things. Instead, it treats them as a team. By learning the "dance" of human interaction, it can reconstruct 3D videos from shaky, first-person camera footage with incredible accuracy, filling in the blanks when things go out of sight or get blocked from view. It turns a chaotic, blurry video into a clear, physically possible 3D story.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →