MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

Imagine you are a director trying to film a movie scene where an actor picks up a specific, intricate object (like a vintage camera or a glowing crystal), spins it around, and hands it to someone else.

In the world of AI video generation, this is currently a nightmare. If you ask an AI to "make a video of a hand spinning a camera," it often gets the hand right but turns the camera into a melting blob of pixels. The camera might suddenly change color, lose its shape, or look like a different object entirely because the AI is just guessing what the back of the camera looks like since it only saw the front.

MVHOI is a new AI system designed to solve this exact problem. Think of it as a two-step "Magic Puppeteer" that uses a 3D blueprint to keep objects looking real, no matter how wildly they are moved.

Here is how it works, broken down into simple concepts:

The Problem: The "Flat Map" vs. The "3D Globe"

Most current AI video tools are like a cartographer trying to draw a globe using only a flat piece of paper. They can handle simple movements (like sliding a cup across a table), but as soon as you try to rotate the cup or hide it behind a hand, the AI gets confused. It doesn't know what the "back" of the cup looks like, so it hallucinates (guesses) a new, often wrong, texture.

The Solution: The Two-Stage Magic Trick

MVHOI solves this by splitting the job into two distinct phases, using a special "3D Foundation Model" (think of this as a super-smart architect who knows how 3D objects work).

Stage 1: The "Ghost Blueprint" (3D-Aware Object Reenactment)

Before the AI tries to make a pretty video, it first builds a Ghost Blueprint.

The Analogy: Imagine you want to move a heavy statue. Instead of trying to drag the statue directly, you first build a wireframe skeleton of it in the air. You move the skeleton exactly how you want the statue to move.
How it works: The system takes the video of the hand moving (the "driving video") and the photos of the object you want to use. It creates a "Unified Object Anchor"—a hidden, 3D digital twin of the object in the AI's brain. It doesn't worry about the pretty details yet; it just figures out the geometry.
The Result: It produces a blurry, low-quality video of the object moving. But here's the magic: the object stays perfectly shaped, doesn't melt, and rotates correctly in 3D space. It's the "skeleton" of the video.

Stage 2: The "High-Definition Painter" (Multi-Reference Video Generation)

Now that the skeleton is moving perfectly, the second stage comes in to paint the skin on it.

The Analogy: Imagine a painter who has a rough sketch of a person running. The painter has a photo album of that person from every angle (front, back, side). The painter looks at the sketch, sees the person is turning their back, and immediately grabs the "back-view" photo from the album to paint that specific frame.
The Problem it Solves: Usually, AI painters get confused and might paint a "front view" face on a "back view" body.
The Fix: MVHOI uses the "Ghost Blueprint" from Stage 1 as a guide. It tells the painter: "Hey, right now the object is facing left, so look at the 'left-view' photo in the album, not the front one."
The Result: The AI grabs the correct texture from the correct angle, ensuring the object looks sharp, realistic, and consistent, even when it's spinning 360 degrees or being hidden behind a hand.

The Secret Sauce: The "Cross-Iterative Loop"

If you try to make a long video (like 10 seconds), AI usually gets tired and starts to drift (the object might slowly turn into a different shape).

MVHOI uses a Cross-Iterative Loop.

The Analogy: Imagine a relay race. Instead of one runner trying to run the whole marathon and getting exhausted, the team passes the baton every few seconds.
How it works: The system generates a short, perfect clip. Then, it takes the end of that perfect clip and uses it as the start for the next clip. By constantly refreshing the "perfect" state, it prevents the video from getting blurry or weird over time.

Why This Matters

Before this, if you wanted to swap a character's hand holding a phone with a different phone in a video, the AI would likely make the phone look like a melting toaster.

With MVHOI:

It understands 3D space: It knows that an object has a back, a side, and a top, even if the camera never sees them.
It remembers the object: The object stays the same object, with the same texture, throughout the whole video.
It handles complex interactions: It works even when hands cover the object or the object spins wildly.

In short, MVHOI bridges the gap between "flat, 2D guessing" and "true, 3D understanding," allowing us to create digital videos where objects behave exactly like real physical things.

1. Problem Statement

The paper addresses the challenge of Human-Object Interaction (HOI) video reenactment, specifically the ability to synthesize realistic videos where a target object interacts with a human hand following a driving video's motion trajectory.

Key Limitations of Existing Methods:

2D Representation Gap: Current approaches rely on sparse 2D descriptors (e.g., keypoints, bounding boxes) or monocular references. These fail to capture complex non-planar 3D dynamics (e.g., rapid object rotation, out-of-plane reorientation), leading to geometric drift and physically implausible artifacts.
View Inconsistency & Texture Hallucination: When objects rotate or are occluded by hands, single-view references provide insufficient cues. Existing models often resort to stochastic generation, resulting in multi-view inconsistency, texture flickering, and "semantic drifting" (the object changing appearance or identity over time).
Pose Ambiguity: Explicit 6D pose estimation is error-prone, especially for symmetric or texture-less objects, causing discontinuous trajectories.

2. Methodology: MVHOI Framework

MVHOI is a two-stage framework that bridges multi-view reference conditions with video foundation models (VFMs) using a 3D Foundation Model (3DFM) as a geometric prior.

Stage I: 3D-Aware Object Reenactment (Geometric Alignment)

The goal is to transfer the source interaction dynamics to the target object while maintaining geometric consistency, without relying on explicit pose estimation.

Unified Object Anchor (UOA): Built upon a 3D Foundation Model (specifically DepthAnything3), the UOA creates a "view-invariant repository" in latent space. It consolidates sparse multi-view inputs into a unified representation where geometry and appearance are aligned.
Implicit Motion Modeling: Instead of estimating 6D poses, the model uses a Motion Extractor (from DisMo) to encode temporal dynamics into latent motion embeddings.
Reenactment Process: The UOA takes the current object frame, multi-view references, and motion embeddings. It modulates the object frame tokens using the motion embeddings to predict a coarse, view-consistent object frame ( $\hat{O}_{t+\Delta t}$ ).
Output: A coarse, blurry but geometrically consistent sequence that serves as a "navigation map" for the next stage, ensuring the object stays synchronized with the human hand.

Stage II: Multi-reference Video Generation (High-Fidelity Synthesis)

The goal is to synthesize high-fidelity, temporally coherent video with sharp textures.

DiT-based Video Inpainting: A Diffusion Transformer (DiT) generates the final video.
Multi-reference Adapter: High-quality multi-view reference images are injected into the model via a cloned context branch.
Geometry-Aware Attention Enhancement: This is a novel inference-time mechanism. The attention maps from Stage I (UOA) reveal which reference view corresponds to the current object orientation. These maps are converted into a logit-level bias ( $B$ $B$ ) and injected into the self-attention layers of the video generator.
- Function: This forces the diffusion model to retrieve textures from the correct reference view based on the current 3D orientation, preventing view confusion and hallucination.
Cross-Iterative Long-Video Inference: To prevent error accumulation in long videos, the framework alternates between:
1. Generating coarse anchor views (Stage I) for a segment.
2. Refining these into high-quality clips (Stage II).
3. Using the final frame of the refined clip as the initialization for the next segment. This "cross-iterative" loop suppresses drift and maintains long-term identity consistency.

3. Key Contributions

First HOI Framework for Complex 3D Dynamics: MVHOI transcends 2D translational models, enabling realistic object swapping and interaction under agile 3D transformations (rotations, occlusions).
Unified Object Anchor (UOA): A novel module leveraging 3D Foundation Models to create a latent space where motion and multi-view appearance are unified, enabling deterministic view-querying rather than stochastic generation.
Geometry-Aware Attention Mechanism: A retrieval mechanism that uses intermediate attention cues from the 3D model to steer the video diffusion process, ensuring viewpoint-consistent texture synthesis.
Cross-Iterative Inference Strategy: A robust method for long-duration video generation that mitigates cumulative drift by alternating between coarse planning and high-fidelity refinement.

4. Experimental Results

The authors evaluated MVHOI on Self-Reenactment (reconstructing the original object) and Cross-Reenactment (swapping the object with a new one) tasks, comparing against state-of-the-art baselines like MimicMotion, VACE, HunyuanCustom, and HuMo.

Quantitative Performance: MVHOI achieved state-of-the-art results across all metrics:
- PSNR: 30.21 (vs. 27.25 for VACE).
- SSIM: 0.960 (vs. 0.954 for VACE).
- LPIPS: 0.025 (lower is better).
- FID: 17.90 (vs. 47.43 for VACE).
- O-CLIP: 0.848 (indicating better semantic alignment).
Qualitative Improvements:
- Significantly reduced object shape distortion and temporal appearance drift.
- Maintained correct hand-object coupling even during rapid 360° rotations.
- Eliminated "texture flickering" and view confusion common in other methods.
Ablation Studies: Confirmed that both the Coarse Reenactment Guidance (CRG) and the Attention Enhancement (AE) are critical. Removing AE led to view confusion, while removing CRG resulted in temporal instability.

5. Significance

MVHOI represents a paradigm shift in controllable video generation by moving away from purely 2D conditioning toward 3D-aware latent modeling. By leveraging 3D Foundation Models as a geometric prior, it solves the fundamental "representation gap" in complex HOI scenarios. This work enables high-fidelity, physically plausible digital human creation and object manipulation, which is crucial for applications in virtual reality, gaming, and cinematic content generation where long-duration, complex interactions are required.