ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Imagine you are watching a magic trick on a flat, 2D television screen. A person opens a fridge, grabs a soda, and closes the door. To your eyes, it looks real. But if you tried to build a physical model of that scene based only on that video, you'd run into a huge problem: The "Flat Screen Illusion."

On a 2D screen, you can't tell if the fridge door is actually swinging open on a hinge, or if the whole fridge is just sliding sideways. You can't tell if the person's hand is inside the fridge or just floating in front of it.

This is the challenge computer scientists face when trying to teach AI to create realistic 3D animations of people interacting with complex objects (like opening cabinets or drawers). Most current AI tools are like "rigid painters"—they can paint a box moving, but they can't paint a box with a moving door.

Enter ArtHOI (Articulated Human-Object Interaction). Think of ArtHOI not as a painter, but as a 3D sculptor who works backward from a video.

Here is how it works, broken down into simple steps:

1. The Problem: The "Flat Screen" Confusion

Imagine you are trying to figure out how a door opens just by watching a security camera feed.

Old AI methods (like ZeroHSI) try to guess the 3D shape directly from the 2D video. They often get confused. They might think the whole cabinet is sliding across the room instead of the door swinging open. Or, they might make the person's hand pass through the door like a ghost.
The Goal: We want the AI to understand that the door has a hinge, moves in a specific arc, and that the human hand must stop at the door, not go through it.

2. The Solution: The "Two-Stage Sculpting" Process

Instead of guessing the whole scene at once (which is like trying to solve a giant puzzle while blindfolded), ArtHOI breaks the job into two distinct steps.

Stage 1: The "Moving Parts" Detective

First, the AI looks at the video and asks: "What is moving, and what is staying still?"

The Flow Map: It uses a tool called "Optical Flow" (think of it as a wind map for pixels) to see how every tiny dot in the video is moving.
The Segmentation: If a group of pixels is moving together in a circle, the AI marks them as a "Door." If other pixels are staying still, it marks them as the "Cabinet Frame."
The Result: The AI builds a rigid skeleton of the object first. It figures out exactly where the hinges are and how the door swings, before worrying about the human. It's like building the furniture first, ensuring the door actually works, before inviting the person in.

Stage 2: The "Human Dancer"

Now that the AI knows exactly how the fridge door moves, it brings in the human.

The Anchor: The AI uses the 3D door it just built as a "hard rule." It tells the human animation: "Your hand must touch the door handle, and your hand cannot go inside the metal."
The Fix: It adjusts the human's movement so it fits perfectly with the door's motion. If the door swings open, the human's hand moves with it. If the door hits a limit, the human stops. This prevents the "ghost hand" problem where hands float through objects.

3. Why This is a Big Deal (The Analogy)

Imagine you are trying to choreograph a dance between a human and a complex machine.

The Old Way: You tell the human and the machine to dance together at the same time, hoping they don't crash. Often, they trip over each other, or the machine breaks because the human pushed it the wrong way.
The ArtHOI Way: You first program the machine to dance perfectly on its own. Once the machine's moves are locked in, you teach the human to dance around the machine, ensuring they hold hands at the right moment and never collide.

4. The Magic Ingredients

No 3D Training Data Needed: Usually, to teach a robot how to open a door, you need thousands of hours of 3D video recordings. ArtHOI is "Zero-Shot," meaning it can learn from a single 2D video generated by a text prompt (like "Open the fridge"). It figures out the 3D physics all by itself.
Physical Reality: It cares about physics. It ensures that if you push a door, it swings. It ensures that if you grab a handle, your hand is actually touching it.

Summary

ArtHOI is a new AI framework that turns flat, 2D videos into realistic 3D scenes where people interact with complex, moving objects (like opening fridges or cabinets).

It does this by first figuring out how the object moves (like a door on a hinge) and then making the human move in a way that respects those rules. This prevents the AI from creating impossible physics, like hands passing through walls or doors sliding sideways instead of swinging. It's the difference between a cartoon that looks "okay" and a simulation that feels physically real.

1. Problem Statement

The paper addresses the challenge of synthesizing physically plausible articulated Human-Object Interactions (HOI) without relying on 3D or 4D ground truth supervision (zero-shot setting).

The Gap: Existing zero-shot methods (e.g., ZeroHSI, TRUMANS) leverage video diffusion models but are limited to rigid object manipulation. They treat objects as single rigid bodies, failing to model complex part-wise articulation (e.g., opening a fridge door, sliding a drawer).
The Challenge: Jointly optimizing human motion and object articulation from monocular video is unstable due to monocular ambiguity. It is difficult to distinguish whether motion in the 2D video arises from human movement, object articulation, or a combination, leading to geometrically inconsistent and physically implausible results (e.g., object parts drifting apart, hands penetrating objects).

2. Methodology: ArtHOI Framework

ArtHOI formulates the synthesis task not as an end-to-end generation problem, but as a 4D reconstruction problem from monocular video priors. It employs a decoupled two-stage pipeline to resolve ambiguity and ensure physical consistency.

Stage I: Object Articulation Reconstruction

The goal is to recover the 4D dynamics of the articulated object before addressing human motion.

Flow-based Part Segmentation:
- Uses optical flow (via CoTracker) to distinguish between dynamic parts (e.g., a moving door) and static parts (e.g., a cabinet frame).
- SAM-guided Masking: Clusters dynamic/static points and uses Segment Anything Model (SAM) to generate dense, boundary-accurate masks for 3D assignment.
- 3D Assignment: Projects 2D masks to 3D Gaussian representations, assigning Gaussians to dynamic or static sets.
- Quasi-static Binding: Identifies "quasi-static" points at articulation boundaries (hinges) and links them to static neighbors to enforce rigid-body constraints.
Optimization:
- Reconstructs articulated motion using SE(3) transformations for dynamic parts.
- Loss Functions:
  - Reconstruction Loss ( $L_r$ ): Matches rendered output to video priors.
  - Kinematic Loss ( $L_a$ ): Enforces distance preservation between quasi-static binding pairs to maintain rigid-body structure.
  - Tracking Loss ( $L_{tr}$ ): Aligns 2D projections of dynamic Gaussians with point tracker trajectories.
  - Smoothness Loss ( $L_s$ ): Ensures temporal coherence of articulation.

Stage II: Human Motion Refinement

With the 4D object scaffold fixed, the framework synthesizes human motion conditioned on the reconstructed object.

3D Contact Keypoint Derivation:
- Since 3D contact points are unobservable in monocular video, the method infers them by analyzing the overlap between the human mask, the reconstructed object silhouette, and the object mask.
- Pixels where the human occludes the object (but the object mask is absent) indicate contact. These 2D regions are lifted to 3D using the depth of the nearest dynamic object Gaussians.
Optimization:
- Optimizes SMPL-X parameters for the human.
- Loss Functions:
  - Kinematic Loss ( $L_k$ ): Pulls hand joints toward the derived 3D contact keypoints.
  - Collision Loss ( $L_c$ ): Penalizes penetration between the human mesh and the object.
  - Foot Sliding Loss ( $L_{fs}$ ): Prevents unrealistic foot movement during ground contact.
  - Prior Loss ( $L_p$ ): Regularizes motion toward the initial video diffusion model estimate to ensure naturalness.

3. Key Contributions

First Zero-Shot Articulated HOI Framework: ArtHOI is the first method to synthesize interactions with articulated objects (doors, cabinets, etc.) without 3D supervision, extending zero-shot capabilities beyond rigid manipulation.
Reconstruction-Informed Synthesis: Instead of end-to-end generation, it treats synthesis as an inverse rendering problem, explicitly modeling part-wise articulation and contact geometry to resolve monocular ambiguity.
Decoupled Two-Stage Pipeline: By separating object articulation recovery from human motion synthesis, the method avoids the instability of joint optimization, ensuring geometric consistency and physical plausibility.
Flow-based Part Segmentation: Introduces a novel geometric cue using optical flow to disentangle dynamic and static regions in monocular video without category-specific templates.

4. Experimental Results

The authors evaluated ArtHOI against state-of-the-art baselines (TRUMANS, LINGO, CHOIS, ZeroHSI, D3D-HOI, 3DADN) on metrics including contact accuracy, penetration reduction, and articulation fidelity.

Quantitative Performance:
- Contact Accuracy: Achieved 75.64% contact consistency, significantly outperforming ZeroHSI (61.95%) and rigid-object baselines.
- Physical Plausibility: Achieved the lowest Penetration% (0.08) and Foot Sliding (0.31), demonstrating superior physical grounding.
- Articulation Accuracy: Reduced mean rotation error to 6.71°, a ~73% improvement over specialized articulated object reconstruction methods (D3D-HOI: 25.13°).
- Semantic Alignment: Achieved the highest X-CLIP score (0.244), indicating better alignment between text prompts and synthesized motion.
Qualitative & User Study:
- In a user study with 51 participants, ArtHOI was preferred over all baselines in Realism, Contact Quality, and Motion Smoothness.
- Specifically, it achieved a 98.04% overall preference rate against TRUMANS and 89.42% against ZeroHSI.
Ablation Studies: Confirmed that the two-stage decoupling and specific loss terms (Kinematic loss $L_k$ and Articulation regularization $L_a$ ) are critical; removing them led to significant drops in contact accuracy and increases in articulation errors.

5. Significance and Impact

Bridging Generation and Geometry: ArtHOI successfully bridges the gap between generative AI (video diffusion) and geometric reasoning (4D reconstruction), producing interactions that are both semantically aligned and physically grounded.
Applications:
- Robotics: Generates scalable training data for manipulation policies involving articulated objects without expensive motion capture.
- VR/AR & Gaming: Enables the creation of realistic human-object interactions for virtual environments without hand-crafted animations.
- Embodied AI: Facilitates the rapid generation of diverse, physically plausible 4D datasets for action recognition and scene understanding research.
Efficiency: The method is efficient, taking approximately 30 minutes per scene on a single NVIDIA A6000 GPU, making it suitable for rapid prototyping.

In summary, ArtHOI represents a paradigm shift in zero-shot HOI synthesis, moving from rigid-body approximations to a robust, geometry-aware framework capable of handling the complexities of articulated structures in 3D space.