MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

This paper proposes MVHOI, a two-stage framework that leverages a 3D Foundation Model to generate view-consistent object priors and a controllable video generation model to synthesize high-fidelity textures, thereby enabling realistic long-duration Human-Object Interaction video reenactment with complex 3D manipulations.

Jinguang Tong, Jinbo Wu, Kaisiyuan Wang, Zhelun Shen, Xuan Huang, Mochu Xiang, Xuesong Li, Yingying Li, Haocheng Feng, Chen Zhao, Hang Zhou, Wei He, Chuong Nguyen, Jingdong Wang, Hongdong Li

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are a director trying to film a movie scene where an actor picks up a specific, intricate object (like a vintage camera or a glowing crystal), spins it around, and hands it to someone else.

In the world of AI video generation, this is currently a nightmare. If you ask an AI to "make a video of a hand spinning a camera," it often gets the hand right but turns the camera into a melting blob of pixels. The camera might suddenly change color, lose its shape, or look like a different object entirely because the AI is just guessing what the back of the camera looks like since it only saw the front.

MVHOI is a new AI system designed to solve this exact problem. Think of it as a two-step "Magic Puppeteer" that uses a 3D blueprint to keep objects looking real, no matter how wildly they are moved.

Here is how it works, broken down into simple concepts:

The Problem: The "Flat Map" vs. The "3D Globe"

Most current AI video tools are like a cartographer trying to draw a globe using only a flat piece of paper. They can handle simple movements (like sliding a cup across a table), but as soon as you try to rotate the cup or hide it behind a hand, the AI gets confused. It doesn't know what the "back" of the cup looks like, so it hallucinates (guesses) a new, often wrong, texture.

The Solution: The Two-Stage Magic Trick

MVHOI solves this by splitting the job into two distinct phases, using a special "3D Foundation Model" (think of this as a super-smart architect who knows how 3D objects work).

Stage 1: The "Ghost Blueprint" (3D-Aware Object Reenactment)

Before the AI tries to make a pretty video, it first builds a Ghost Blueprint.

  • The Analogy: Imagine you want to move a heavy statue. Instead of trying to drag the statue directly, you first build a wireframe skeleton of it in the air. You move the skeleton exactly how you want the statue to move.
  • How it works: The system takes the video of the hand moving (the "driving video") and the photos of the object you want to use. It creates a "Unified Object Anchor"—a hidden, 3D digital twin of the object in the AI's brain. It doesn't worry about the pretty details yet; it just figures out the geometry.
  • The Result: It produces a blurry, low-quality video of the object moving. But here's the magic: the object stays perfectly shaped, doesn't melt, and rotates correctly in 3D space. It's the "skeleton" of the video.

Stage 2: The "High-Definition Painter" (Multi-Reference Video Generation)

Now that the skeleton is moving perfectly, the second stage comes in to paint the skin on it.

  • The Analogy: Imagine a painter who has a rough sketch of a person running. The painter has a photo album of that person from every angle (front, back, side). The painter looks at the sketch, sees the person is turning their back, and immediately grabs the "back-view" photo from the album to paint that specific frame.
  • The Problem it Solves: Usually, AI painters get confused and might paint a "front view" face on a "back view" body.
  • The Fix: MVHOI uses the "Ghost Blueprint" from Stage 1 as a guide. It tells the painter: "Hey, right now the object is facing left, so look at the 'left-view' photo in the album, not the front one."
  • The Result: The AI grabs the correct texture from the correct angle, ensuring the object looks sharp, realistic, and consistent, even when it's spinning 360 degrees or being hidden behind a hand.

The Secret Sauce: The "Cross-Iterative Loop"

If you try to make a long video (like 10 seconds), AI usually gets tired and starts to drift (the object might slowly turn into a different shape).

MVHOI uses a Cross-Iterative Loop.

  • The Analogy: Imagine a relay race. Instead of one runner trying to run the whole marathon and getting exhausted, the team passes the baton every few seconds.
  • How it works: The system generates a short, perfect clip. Then, it takes the end of that perfect clip and uses it as the start for the next clip. By constantly refreshing the "perfect" state, it prevents the video from getting blurry or weird over time.

Why This Matters

Before this, if you wanted to swap a character's hand holding a phone with a different phone in a video, the AI would likely make the phone look like a melting toaster.

With MVHOI:

  1. It understands 3D space: It knows that an object has a back, a side, and a top, even if the camera never sees them.
  2. It remembers the object: The object stays the same object, with the same texture, throughout the whole video.
  3. It handles complex interactions: It works even when hands cover the object or the object spins wildly.

In short, MVHOI bridges the gap between "flat, 2D guessing" and "true, 3D understanding," allowing us to create digital videos where objects behave exactly like real physical things.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →