DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

Imagine you are watching a shaky, chaotic home video of a friend dancing in a park. The camera is moving wildly, the friend sometimes walks behind a tree (getting hidden), and the lighting changes. Your brain, however, is a magic machine: it instantly figures out exactly where your friend is in the real world, how they are moving, and even what they are doing while hidden behind the tree.

DuoMo is an artificial intelligence designed to do exactly what your brain does, but it's trying to solve a very tricky math problem.

Here is the story of how DuoMo works, explained without the heavy jargon.

The Problem: The "Shaky Camera" Dilemma

Most AI that tries to track people in videos gets confused by two things:

The Camera is Moving: Is the person walking forward, or is the camera just zooming in?
The "World" is Missing: If the person disappears behind a tree, the AI usually panics and forgets where they were supposed to go.

Old methods tried to solve this in one giant leap: "Guess the person's position in the real world directly from the video." But this is like trying to bake a perfect cake by throwing all the ingredients into a bowl at once and hoping for the best. It often results in a mess where the person floats in the air or walks through walls.

The DuoMo Solution: The Two-Step Dance

Instead of one giant guess, DuoMo uses two specialized experts working in a team. Think of it like a Director and a Stunt Coordinator.

Step 1: The Camera-Space Model (The "Stunt Coordinator")

First, DuoMo looks at the video and asks: "What is happening relative to the camera?"

The Analogy: Imagine you are sitting in a car watching a runner. The runner looks like they are moving left and right, up and down, based on how your car is swerving.
What it does: This model is great at looking at the raw video and saying, "Okay, the runner's arm is moving this way relative to the lens." It doesn't care about the real world yet; it just cares about what it sees on the screen.
The Catch: Because the camera is shaky, this view is "noisy" and distorted. If the camera moves, the runner looks like they are teleporting.

Step 2: The World-Space Model (The "Director")

Next, DuoMo takes that shaky, camera-relative view and asks: "Okay, but where are they actually standing in the park?"

The Analogy: The Director looks at the Stunt Coordinator's notes and says, "Wait, the camera was actually spinning left, so the runner didn't teleport; they just walked straight."
What it does: This model takes the "noisy" guess from Step 1 and cleans it up. It uses the rules of physics and common sense to say, "Humans don't float, and they don't walk through trees."
The Magic: If the runner disappears behind a tree, the Director doesn't panic. It says, "I know they were walking left, so they must still be walking left behind that tree." It fills in the missing gaps using its knowledge of how humans move.

The Secret Sauce: "Guided Sampling"

Sometimes, even the Director gets a little lost over a long time (like if the video is 20 seconds long). The AI might slowly drift, making the person end up in the wrong spot.

To fix this, DuoMo uses Guided Sampling.

The Analogy: Imagine the Director is walking a dog on a leash. The dog (the AI's guess) might wander off a bit, but every few seconds, the Director pulls the leash back to check the map (the original video).
How it works: The AI constantly checks its own work against the original video. "Wait, I said the person is here, but the video shows their feet are actually there. Let me adjust my guess." This keeps the person grounded in reality, preventing them from drifting off into space.

Why This is a Big Deal

No "Body Suit" Needed: Most AI tries to fit a pre-made 3D body model (like a digital mannequin) onto the video. DuoMo is different; it builds the 3D shape vertex by vertex (like sculpting clay). This means it can handle weird poses or body shapes that standard "mannequins" can't fit.
It Handles the Wild: It works great on shaky, amateur videos from the real world, not just perfect studio recordings.
It Fills in the Blanks: If a person is hidden for a long time, DuoMo can "hallucinate" (predict) their movement in a way that makes physical sense, rather than just freezing them.

In a Nutshell

DuoMo is like a super-smart film editor who watches a shaky, chaotic video and reconstructs a perfect, stable 3D movie of what actually happened in the real world. It does this by first figuring out what the camera saw, and then using a second "brain" to correct the camera's mistakes and fill in the missing parts, ensuring the person stays grounded, realistic, and consistent throughout the whole scene.

1. Problem Statement

The paper addresses the challenge of reconstructing 3D human motion in world-space coordinates from unconstrained, monocular videos. This task involves a fundamental trade-off:

Generalizability: The ability to handle diverse, noisy, and incomplete observations (e.g., occlusions, varying camera motions, complex terrains).
Global Consistency: The ability to maintain physically plausible motion and consistent trajectories within a global coordinate system over time.

Existing methods struggle with this trade-off. End-to-end models often fail to generalize outside studio settings due to limited training data diversity, while "lifting" approaches (estimating pose in camera space then transforming to world space) often suffer from weak motion priors, leading to physical implausibility (e.g., foot skating) and trajectory drift.

2. Methodology: DuoMo

The authors propose DuoMo, a two-stage generative framework that factorizes the problem into Camera-Space Estimation and World-Space Refinement using two independent diffusion models.

A. Motion Representation

Mesh Vertices: Unlike most prior works that regress low-dimensional parameters of parametric models (like SMPL/SMPLX), DuoMo generates the motion of 3D mesh vertices directly.
Sparse Mesh: They utilize a 595-vertex sparse mesh (LOD6 from MHR) to balance detail and efficiency.
Coordinate Systems:
- Camera-Space ( $C$ ): Motion relative to the instantaneous camera frame.
- World-Space ( $W$ ): Motion relative to a fixed coordinate system defined by the initial camera pose of the video.
- Velocity Modeling: To handle unbounded root positions in world space, the world-space model predicts root velocity ( $v_t$ ) rather than absolute position, integrating it over time to recover the trajectory.

B. Stage 1: Camera-Space Diffusion ( $D_{cam}$ )

Input: Video frames ( $I_t$ ) and dense 2D keypoints ( $L_t$ ).
Feature Extraction:
- Keypoints are converted to ray directions using camera intrinsics ( $K_t$ ) and encoded via positional embeddings.
- Image features are extracted via a frozen image encoder (PromptHMR).
- Height Conditioning: The model can optionally condition on the subject's height to resolve scale ambiguity (crucial for accurate lifting).
Architecture: A Diffusion Transformer (DiT) with relative positional embeddings (RoPE) and windowed attention to handle long sequences without chunking.
Output: A noisy estimate of human motion in camera coordinates ( $C$ ).

C. Stage 2: World-Space Diffusion ( $D_{world}$ )

Input: The camera-space motion is "lifted" to world coordinates using estimated camera poses ( $g_t$ ), creating a noisy proposal ( $\hat{W}$ ).
Process: The world-space model treats this lifted motion as a noisy input and conditions on it to generate a clean, globally consistent motion sequence ( $W$ ).
Key Innovation (Per-Video Coordinates): Instead of aligning to a canonical "lab" coordinate system (which requires estimating ground planes and is error-prone in the wild), DuoMo defines the world coordinate system relative to the starting camera pose of each specific video. This allows the model to learn denoising directly in diverse, arbitrary coordinate systems.
Occlusion Handling: During training, visibility tokens are randomly masked to teach the model to generate plausible motion during long occlusions.

D. Guided Sampling

To address temporal drift and occlusion errors during inference, DuoMo employs test-time guided sampling (x0-guidance):

2D Reprojection Guidance: Minimizes the error between the projected 3D mesh and the detected 2D keypoints in the original video, correcting trajectory drift.
Displacement Guidance: For long occlusions, it enforces that the integrated root velocity matches the displacement between the last visible position and the first reappearance position.

E. Conversion to SMPLX

While the core architecture generates mesh vertices, a lightweight iterative MLP is trained to convert the output sparse mesh to SMPLX parameters for compactness and comparison with prior works.

3. Key Contributions

Dual-Prior Factorization: A novel two-stage diffusion approach that decouples the difficult tasks of 2D-to-3D lifting (camera-space) and global consistency (world-space), allowing each model to specialize.
Robust World-Space Modeling: A world-space model trained on per-video coordinate systems rather than a fixed canonical space, making it robust to varied terrains and camera motions without complex alignment steps.
Direct Mesh Generation: An architecture that generates mesh vertex motion directly, bypassing parametric body models and demonstrating a generalizable path for geometric motion modeling.
Guided Sampling Strategy: Effective test-time guidance mechanisms that correct for drift and enforce physical consistency during occlusions.

4. Experimental Results

The method was evaluated on EMDB, RICH, and Egobody datasets.

State-of-the-Art Performance:
- EMDB: Achieved a 16% reduction in world-space reconstruction error (W-MPJPE) compared to the second-best method, while maintaining low foot-skating artifacts.
- RICH: Achieved a 30% reduction in world-space error.
- Egobody (Occlusion): Demonstrated superior robustness in generating plausible motion during long occlusions (out-of-frame segments) where baseline "lifting" methods failed completely.
Ablation Studies:
- Removing the dual-prior structure (using only one stage) resulted in either poor global consistency or poor lifting accuracy.
- Direct mesh generation was shown to be highly competitive with parametric approaches.
- Height conditioning significantly improved metric accuracy by resolving scale ambiguity.
Robustness to Camera Noise: DuoMo remained stable under simulated SLAM drift (camera pose noise), whereas baseline methods suffered rapid degradation in accuracy and physical plausibility.

5. Significance

DuoMo represents a significant advancement in 3D human motion reconstruction by successfully bridging the gap between local view-dependent estimation and global world-space consistency.

Practical Impact: It enables the reconstruction of human motion in "in-the-wild" scenarios (moving cameras, uneven terrain, occlusions) without requiring complex scene understanding or manual alignment.
Methodological Shift: By moving away from parametric body models (SMPL) toward direct mesh vertex generation, the authors suggest a more flexible and generalizable approach to modeling human dynamics that could extend to other object categories.
Generative Regularization: The work demonstrates how generative diffusion models can act as powerful regularizers to enforce physical plausibility and global coherence in geometric reconstruction tasks.

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

The Problem: The "Shaky Camera" Dilemma

The DuoMo Solution: The Two-Step Dance

Step 1: The Camera-Space Model (The "Stunt Coordinator")

Step 2: The World-Space Model (The "Director")

The Secret Sauce: "Guided Sampling"

Why This is a Big Deal

In a Nutshell

1. Problem Statement

2. Methodology: DuoMo

A. Motion Representation

B. Stage 1: Camera-Space Diffusion (DcamD_{cam}Dcam​)

C. Stage 2: World-Space Diffusion (DworldD_{world}Dworld​)

D. Guided Sampling

E. Conversion to SMPLX

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization

B. Stage 1: Camera-Space Diffusion ( $D_{cam}$ )

C. Stage 2: World-Space Diffusion ( $D_{world}$ )