GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Imagine you are a director trying to film a scene where an actor picks up a specific product—say, a unique, limited-edition coffee mug—and holds it while talking.

In the real world, this is easy. But in the digital world of AI video generation, it's a nightmare. If you ask a standard AI to "make a video of a person holding a mug," it usually does one of two things:

The "Glitchy" Approach: The mug looks great in the first second, but by the third second, it starts melting, changing color, or turning into a different mug entirely.
The "Sticker" Approach: The AI pastes a picture of the mug onto the hand, but the hand just floats above it. There's no real interaction; the mug doesn't look like it's actually being held.

GenHOI is a new tool designed to fix exactly this problem. Think of it as a "smart editor" that plugs into existing video AI models to teach them how to handle objects correctly.

Here is how it works, using some simple analogies:

1. The Problem: The "Fading Memory"

Standard AI models have a short attention span. When they look at a reference image (the "mug" you want to use) at the start of the video, they remember it well for the first few frames. But as the video plays, that memory fades. By the end, the AI forgets what the mug looked like and starts hallucinating a new one.

2. The Solution: GenHOI's Two Superpowers

GenHOI adds two special "brain upgrades" to the AI to solve this.

Upgrade A: The "Sliding Spotlight" (Head-Sliding RoPE)

The Analogy: Imagine you are trying to remember a song. If you only listen to the first note once, you'll forget it by the end of the song. But if you have a chorus that repeats and shifts slightly throughout the song, you remember it perfectly.
How it works: Usually, the AI looks at the reference object (the mug) as if it only exists at "Time Zero." GenHOI changes this. It takes the information about the mug and "slides" it across the timeline of the video, spreading it out evenly.
The Result: Instead of the AI forgetting the mug after 2 seconds, the "memory" of the mug is refreshed in every single frame. The mug stays the same color, shape, and logo from start to finish, even in a long video.

Upgrade B: The "Smart Mask" (Spatial Attention Gate)

The Analogy: Imagine you are painting a picture of a person holding a cup. You want to be very careful and detailed when painting the cup and the hand holding it. But when you paint the background (the wall or the table), you don't want to accidentally smear the cup's design onto the wall.
How it works: GenHOI puts up a digital "fence."
- Inside the fence (The Hand & Object): The AI is allowed to look at the reference photo and copy the mug's details perfectly.
- Outside the fence (The Background): The AI is strictly told, "Do not look at the mug photo here." It must rely on the original video for the background.
The Result: The hand and mug look hyper-realistic and consistent, but the background doesn't get weird artifacts or weirdly change color because the AI got confused.

3. Why This Matters

Before GenHOI, if you wanted to make a video for an e-commerce site showing a model holding your new shoe, you'd have to film it in a studio with perfect lighting.

With GenHOI, you can take a video of a person walking down the street (even if the lighting is messy) and swap their empty hand for a video of them holding your specific shoe. The shoe will look real, stay consistent, and the hand will grip it naturally, not just float above it.

Summary

Old Way: The AI forgets the object halfway through the video or pastes it on like a sticker.
GenHOI Way: It uses a "Sliding Spotlight" to keep the object's memory fresh throughout the whole video, and a "Smart Mask" to ensure the object only affects the hand, not the background.

It's like giving the AI a pair of glasses that helps it focus on the object it needs to hold, while ignoring everything else, ensuring the object never changes its mind about what it looks like.

1. Problem Statement

Hand-Object Interaction (HOI) is a critical challenge in digital human video synthesis, particularly for applications like e-commerce and online education. The core difficulty lies in generating videos where a human hand interacts with an object while satisfying two conflicting requirements:

Physical Plausibility: The interaction must look natural, with correct contact and occlusion.
Object Identity Consistency: The object's visual attributes (color, texture, logo) must remain consistent across all frames.

Current Limitations:

Specialized HOI Models: Existing methods (e.g., HOI-Swap, Re-HOLD) are often trained on in-domain data and fail to generalize to "in-the-wild" scenarios with complex backgrounds and diverse object shapes.
All-in-One Video Editors: General video editing models (e.g., VACE) trained on massive internet data show better generalization but struggle with HOI-specific tasks. They frequently suffer from temporal object decay (the object changes appearance or disappears over time) and spatial inconsistency (the object blends poorly with the background or fails to interact naturally).

2. Methodology

The authors propose GenHOI, a lightweight augmentation module designed to plug into pre-trained video generation models (specifically DiT-based architectures like Wan-14B). Instead of retraining the entire model, GenHOI injects reference-object information in a temporally balanced and spatially selective manner.

The framework consists of three main components:

A. HOI Condition Unit (HCU)

This module adapts the input stream for the HOI task without adding new network branches.

Input Construction: It takes the source video, a binary mask defining the interaction region, and the reference object image.
Inpainting Formulation: It reformulates the task as a video inpainting problem. The reference video ( $V_r$ ) is constructed by masking out the interaction region in the source video and filling it with a constant value (or the reference image in specific baselines), while the mask guides the model on where to generate.
Latent Concatenation: The noisy target latent, the reference video latent, and the processed mask are concatenated along the channel dimension to form the input for the Diffusion Transformer (DiT).

B. Head-Sliding RoPE (Temporal Balancing)

Standard Rotary Positional Embeddings (RoPE) assign a fixed temporal offset to reference tokens (often frame -1). This causes temporal decay, where the reference object's influence is strong in early frames but vanishes in later frames.

Mechanism: GenHOI introduces Head-Sliding RoPE. Instead of assigning a single fixed frame index to all reference tokens, it assigns head-specific temporal offsets.
Effect: The frame index for reference tokens "slides" across different attention heads. This distributes the influence of the reference object evenly across the entire video duration, mitigating the temporal decay of 3D RoPE and ensuring long-range object consistency.

C. Spatial Attention Gate (Spatial Selectivity)

To prevent the reference object from corrupting the background or failing to interact with the hand, a two-level gate is applied:

Hard Mask Gate (HMG): A binary mask enforces strict information flow rules.
- Reference tokens can attend to HOI-region queries.
- Background queries are blocked from attending to reference tokens (preventing background artifacts).
- Reference tokens do not attend back to video keys (preventing self-contamination).
Soft Flow Gate (SFG): A learnable, token-wise gating mechanism (using LayerNorm, FC, and Sigmoid) that adaptively scales the strength of the injected information. It amplifies informative regions and suppresses redundant responses, ensuring the background remains realistic while the interaction is enhanced.

3. Key Contributions

Head-Sliding RoPE: A novel positional embedding strategy that balances the influence of reference tokens across time, solving the temporal consistency issue in long video generation.
Two-Level Spatial Attention Gate: A mechanism that combines hard masking and soft scaling to concentrate object conditioning strictly on interaction regions, preserving background fidelity.
Lightweight Augmentation: The method adds only ~157M parameters (approx. 0.95% of the base 16.5B model), allowing it to leverage the generalization capabilities of large pre-trained models while specializing in HOI.
State-of-the-Art Performance: Demonstrated superior generalization on unseen, in-the-wild datasets compared to both specialized HOI reenactment models and general all-in-one video editors.

4. Experimental Results

The method was evaluated on the AnchorCrafter HOI dataset (50 self-reenactment and 50 cross-reenactment videos) using both quantitative metrics and user studies.

Quantitative Performance:
- Short Videos (81 frames): GenHOI achieved a PSNR of 31.71 (vs. 28.60 for VACE) and an Object CLIP score of 0.937, significantly outperforming VACE, UniAnimate-DiT, and HOI-Swap.
- Long Videos (401 frames): The performance gap widened, with GenHOI achieving a PSNR of 30.69 and an FVD of 42.17, compared to VACE's 26.32 PSNR and 195.9 FVD.
User Study:
- GenHOI scored 4.636/5 on Reference Fidelity and 4.487/5 on Video Quality, significantly higher than the next best competitor (VACE: 2.79 and 3.94).
Qualitative Analysis:
- The model successfully handles deformable objects, dynamic physics, complex occlusions, and object rotation.
- It demonstrates robustness to mask precision variations and does not strictly require a perfectly edited first frame, relying instead on the reference image for texture.

5. Significance

GenHOI represents a significant step forward in controllable video generation. By decoupling the injection of object identity from the temporal and spatial constraints of the generation process, it solves the "consistency vs. realism" trade-off that plagues current video editing models.

Practical Impact: It enables cost-effective production of high-quality commercial videos (e.g., e-commerce product demos) where specific products must be interacted with by digital humans in diverse, real-world-like settings.
Technical Insight: The paper provides a blueprint for how to modify large-scale foundation models for specific, fine-grained tasks (like HOI) without sacrificing their broad generalization capabilities, using lightweight, targeted architectural interventions.