Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Imagine you are wearing a pair of magic glasses. In the world of today's Virtual Reality (VR), if you want to see a dragon, a wizard has to spend weeks building a 3D model of that dragon, rigging its bones, and programming how it moves. It's like building a real-life puppet show from scratch every time you want a new show.

This paper introduces a new concept called "Generated Reality." Instead of building puppets, imagine you have a super-smart, instant storyteller inside your glasses. You just wave your hand, turn your head, or say "I want to see a dragon," and the storyteller instantly paints a brand-new, photorealistic video world around you, frame by frame.

Here is how they made this magic work, broken down into simple parts:

1. The Problem: The "Remote Control" Limitation

Current AI video generators are like a TV remote that only has "Up," "Down," "Left," and "Right" buttons. You can tell the AI to "move the camera left," but you can't tell it, "Pick up that cup with your thumb and index finger."

The Analogy: Trying to play a complex video game with a keyboard that only has the spacebar. You can jump, but you can't shoot, dodge, or interact with specific objects.
The Result: You can't really do things in these virtual worlds; you can only watch them happen.

2. The Solution: Teaching the AI to "Feel" Your Hands

The researchers wanted to give the AI a pair of hands. They figured out how to feed the AI two specific things in real-time:

Where your head is looking (Camera control).
Exactly how your fingers are bending (Joint-level hand control).

They didn't just say "move hand." They tracked every single joint in your fingers (20 of them per hand!) and your wrist.

3. The Secret Sauce: The "Hybrid" Recipe

The team tried many ways to teach the AI how to draw hands based on your movements. They found that the best method was a hybrid approach, like using two different maps to find your way:

Map A (The 2D Skeleton): A simple stick-figure drawing of your hand overlaid on the screen. This tells the AI where the hand is on the screen.
Map B (The 3D Data): The actual mathematical numbers describing your finger angles. This tells the AI how deep your hand is and how your fingers are curled.

The Analogy: Imagine trying to draw a person holding a ball.

If you only give the artist a 2D photo, they might draw the hand behind the ball or inside it because they can't see the depth.
If you only give them the math numbers, they know the depth but might draw the hand in the wrong spot on the paper.
The Hybrid: You give them both. The artist knows exactly where the hand is on the paper and exactly how it's holding the ball in 3D space. This stopped the AI from making weird, glitchy hands that disappear or float in impossible ways.

4. The "Instant Movie" Machine

To make this fast enough for VR (so you don't get dizzy), they took a huge, slow AI model (the "Teacher") and distilled it into a smaller, faster "Student" model.

The Analogy: Think of the Teacher as a master chef who takes 20 minutes to cook a perfect meal. The Student is a sous-chef who learned the recipe and can now whip up a delicious version in 12 seconds.
The Speed: They achieved 11 frames per second with a delay of only 1.4 seconds. This means when you wave your hand, the virtual world reacts almost instantly.

5. The Proof: Did It Work?

They put this system in a VR headset and asked people to do three tasks:

Push a green button.
Open a jar.
Turn a steering wheel.

The Results:

Without Hand Control (The Baseline): The AI tried to guess what the user wanted based on text prompts. It failed almost 100% of the time. It was like trying to open a jar by yelling "Open!" at it.
With Hand Control (The New System): The AI watched the user's actual hand movements. The success rate jumped to 71%.
The Feeling: Users reported feeling like they had real control over the world, rather than just being a passenger watching a movie.

Why Does This Matter?

This is a huge step toward "Zero-Shot" Learning.

Before: If you wanted to practice surgery or fix a car engine in VR, you needed a team of engineers to build a perfect 3D simulation of that specific surgery or engine.
Now: With "Generated Reality," you can just say, "Show me a car engine," and the AI generates it instantly. You can practice opening a jar or turning a wheel, and the AI will generate the jar and the wheel reacting to your actual hands in real-time.

In a nutshell: This paper teaches AI to stop just "watching" you and start "listening" to your hands, turning static virtual worlds into interactive playgrounds that build themselves as you move.

1. Problem Statement

Extended Reality (XR) applications (VR, AR, MR) require generative models that can respond to real-time, fine-grained user motion to create immersive experiences. However, current video "world models" are limited in their control mechanisms:

Coarse Inputs: Most models rely on text prompts, keyboard inputs, or simple camera motion, which are insufficient for embodied interaction.
Lack of Dexterous Control: Existing approaches often treat hands as part of a full-body pose or use low-fidelity signals (e.g., binary masks), failing to capture the precise wrist and finger articulations required for complex hand-object interactions.
Latency and Interactivity: Bidirectional diffusion models, while high quality, are not causal and cannot run in real-time. Autoregressive models exist but often lack the precision to condition on joint-level hand data effectively.

The core challenge is to develop a human-centric video world model that accepts tracked head pose and joint-level hand poses to generate interactive, ego-centric virtual environments in real-time with high fidelity.

2. Methodology

The proposed system, Generated Reality, integrates user tracking data into an autoregressive video generation pipeline through a novel conditioning strategy and model distillation.

A. Data and Tracking

Input: The system tracks the user's head (camera pose) and hands using a commercial VR headset (Meta Quest 3).
Hand Representation: Hands are modeled using the UmeTrack model, which provides:
- 6-DoF wrist transformation (translation + rotation).
- Rotation angles for 20 finger joints per hand.
- This creates a high-dimensional Hand Pose Parameter (HPP) vector ( $H$ ).

B. Conditioning Strategy: Hybrid 2D–3D

The authors conducted a systematic ablation study to determine the best way to inject hand and camera data into a Diffusion Transformer (DiT). They evaluated token concatenation, token addition, Adaptive Layer Normalization (AdaLN), and Cross-Attention.

The Winning Strategy: A Hybrid 2D–3D approach combining:

2D Skeleton Video: A rendered 2D image of the hand skeleton from the user's viewpoint (ControlNet-style). This provides spatial grounding.
3D HPP Embeddings: The raw 3D joint parameters.
Injection Mechanism:
- The 2D skeleton video and the raw video are encoded into latent space.
- The 3D HPP features are extracted and added via element-wise token addition to the latent tokens.
- Camera Control: 6-DoF camera poses are converted into Plücker embeddings and also added via token addition.
- Formula: $x = \text{patchify}([z_r, z_c]) + E_{conv}(H) + E_{cam}(P)$
- Result: This hybrid method resolves depth ambiguity and self-occlusion issues inherent in 2D-only skeletons while maintaining strong spatial alignment.

C. Model Architecture and Distillation

Teacher Model: A bidirectional video diffusion model (based on Wan2.2 14B) trained with the hybrid conditioning strategy. It generates high-quality, temporally coherent videos but is non-causal.
Student Model (Generated Reality): The teacher is distilled into a causal, autoregressive student model (5B parameters).
- It generates video in 12-frame chunks.
- It uses the previous frames as context and the current tracked head/hand poses as conditioning.
- Performance: Runs at 11 FPS with 1.4s latency on a single H100 GPU.

3. Key Contributions

Systematic Study of Hand Conditioning: The first comprehensive comparison of conditioning strategies (Token Concat, Addition, AdaLN, Cross-Attention, ControlNet) for joint-level hand poses in video diffusion. They identified Token Addition combined with a Hybrid 2D–3D representation as the superior approach.
Joint Head-Hand Control Framework: A unified architecture that simultaneously conditions on 6-DoF camera pose and 20-joint hand articulation, enabling realistic ego-centric interactions.
Interactive Real-Time System: The successful distillation of a high-fidelity bidirectional teacher into a real-time, autoregressive student model capable of running on consumer-grade VR hardware (Quest 3) with server-side inference.
User-Centric Evaluation: A rigorous user study demonstrating that explicit hand control significantly outperforms text-only or head-only baselines in task completion and perceived control.

4. Experimental Results

Quantitative Metrics (HOT3D Dataset)

Hand Pose Accuracy: The Hybrid 2D–3D strategy achieved the lowest error rates:
- MPJPE (3D Joint Error): 12.23 mm (vs. 17.86 mm for unconditioned baseline).
- MPVPE (3D Vertex Error): 9.10 mm.
- L2 Landmark Error (2D): 11.50 pixels.
- Note: The hybrid method approached the theoretical lower bound of the WiLoR estimator (9.42 MPJPE), indicating near-optimal reconstruction given the estimator's limits.
Video Quality: Maintained competitive quality (FVD: 383.69) compared to baselines, proving that adding complex conditioning did not degrade visual fidelity.
Joint Control: The "JointCtrl" model balanced camera and hand accuracy, whereas single-modality models (Camera-only or Hand-only) failed to control the other modality effectively.

Qualitative Results

The hybrid model successfully reconstructed hands near frame boundaries and during self-occlusion, where pure 2D ControlNet approaches failed.
Generated videos showed dexterous interactions (e.g., gripping a flag, opening a jar) consistent with the user's actual hand movements.

User Study

Setup: 11 participants performed three tasks ("push button," "open jar," "turn steering wheel") in VR.
Task Accuracy:
- Baseline (Text/Head only): 3.0% success rate.
- Generated Reality (Hand + Head): 71.2% success rate.
Perceived Control: Users rated their sense of control significantly higher with the hand-conditioned model (4.21/7) compared to the baseline (1.74/7).

5. Significance and Future Outlook

Significance:
This work bridges the gap between generative AI and embodied XR. It demonstrates that video world models can move beyond coarse text/camera control to support dexterous, fine-grained human interaction. By enabling "zero-shot" generation of complex 3D interactions without manual 3D asset modeling, it opens new possibilities for:

Immersive training and rehabilitation.
Real-time generative guidance via smart eyewear.
Interactive media where the environment reacts naturally to user gestures.

Limitations & Future Work:

Drift: Like all autoregressive models, image quality degrades over long rollouts (drift).
Latency/Resolution: Current latency (1.4s) and resolution are below modern VR standards.
Future Goals: The authors aim to reduce latency to <20ms, achieve retinal resolution, and enable long-horizon rollouts on wearable computers.

In conclusion, Generated Reality establishes a new paradigm for human-centric world simulation, proving that combining 2D spatial grounding with 3D parametric control allows generative models to understand and react to the nuanced movements of the human body in real-time.