DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

DuoMo is a state-of-the-art generative method that reconstructs globally consistent, world-space human motion from unconstrained and noisy videos by employing a dual diffusion framework that first estimates camera-space motion and then refines it into a coherent world-space trajectory without relying on parametric models.

Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are watching a shaky, chaotic home video of a friend dancing in a park. The camera is moving wildly, the friend sometimes walks behind a tree (getting hidden), and the lighting changes. Your brain, however, is a magic machine: it instantly figures out exactly where your friend is in the real world, how they are moving, and even what they are doing while hidden behind the tree.

DuoMo is an artificial intelligence designed to do exactly what your brain does, but it's trying to solve a very tricky math problem.

Here is the story of how DuoMo works, explained without the heavy jargon.

The Problem: The "Shaky Camera" Dilemma

Most AI that tries to track people in videos gets confused by two things:

  1. The Camera is Moving: Is the person walking forward, or is the camera just zooming in?
  2. The "World" is Missing: If the person disappears behind a tree, the AI usually panics and forgets where they were supposed to go.

Old methods tried to solve this in one giant leap: "Guess the person's position in the real world directly from the video." But this is like trying to bake a perfect cake by throwing all the ingredients into a bowl at once and hoping for the best. It often results in a mess where the person floats in the air or walks through walls.

The DuoMo Solution: The Two-Step Dance

Instead of one giant guess, DuoMo uses two specialized experts working in a team. Think of it like a Director and a Stunt Coordinator.

Step 1: The Camera-Space Model (The "Stunt Coordinator")

First, DuoMo looks at the video and asks: "What is happening relative to the camera?"

  • The Analogy: Imagine you are sitting in a car watching a runner. The runner looks like they are moving left and right, up and down, based on how your car is swerving.
  • What it does: This model is great at looking at the raw video and saying, "Okay, the runner's arm is moving this way relative to the lens." It doesn't care about the real world yet; it just cares about what it sees on the screen.
  • The Catch: Because the camera is shaky, this view is "noisy" and distorted. If the camera moves, the runner looks like they are teleporting.

Step 2: The World-Space Model (The "Director")

Next, DuoMo takes that shaky, camera-relative view and asks: "Okay, but where are they actually standing in the park?"

  • The Analogy: The Director looks at the Stunt Coordinator's notes and says, "Wait, the camera was actually spinning left, so the runner didn't teleport; they just walked straight."
  • What it does: This model takes the "noisy" guess from Step 1 and cleans it up. It uses the rules of physics and common sense to say, "Humans don't float, and they don't walk through trees."
  • The Magic: If the runner disappears behind a tree, the Director doesn't panic. It says, "I know they were walking left, so they must still be walking left behind that tree." It fills in the missing gaps using its knowledge of how humans move.

The Secret Sauce: "Guided Sampling"

Sometimes, even the Director gets a little lost over a long time (like if the video is 20 seconds long). The AI might slowly drift, making the person end up in the wrong spot.

To fix this, DuoMo uses Guided Sampling.

  • The Analogy: Imagine the Director is walking a dog on a leash. The dog (the AI's guess) might wander off a bit, but every few seconds, the Director pulls the leash back to check the map (the original video).
  • How it works: The AI constantly checks its own work against the original video. "Wait, I said the person is here, but the video shows their feet are actually there. Let me adjust my guess." This keeps the person grounded in reality, preventing them from drifting off into space.

Why This is a Big Deal

  1. No "Body Suit" Needed: Most AI tries to fit a pre-made 3D body model (like a digital mannequin) onto the video. DuoMo is different; it builds the 3D shape vertex by vertex (like sculpting clay). This means it can handle weird poses or body shapes that standard "mannequins" can't fit.
  2. It Handles the Wild: It works great on shaky, amateur videos from the real world, not just perfect studio recordings.
  3. It Fills in the Blanks: If a person is hidden for a long time, DuoMo can "hallucinate" (predict) their movement in a way that makes physical sense, rather than just freezing them.

In a Nutshell

DuoMo is like a super-smart film editor who watches a shaky, chaotic video and reconstructs a perfect, stable 3D movie of what actually happened in the real world. It does this by first figuring out what the camera saw, and then using a second "brain" to correct the camera's mistakes and fill in the missing parts, ensuring the person stays grounded, realistic, and consistent throughout the whole scene.