GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

GenHOI is a lightweight augmentation for pretrained video generation models that enhances object-consistent hand-object interaction in in-the-wild scenarios by employing Head-Sliding RoPE for temporally balanced reference injection and a two-level spatial attention gate for selective focus on interaction regions.

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are a director trying to film a scene where an actor picks up a specific product—say, a unique, limited-edition coffee mug—and holds it while talking.

In the real world, this is easy. But in the digital world of AI video generation, it's a nightmare. If you ask a standard AI to "make a video of a person holding a mug," it usually does one of two things:

  1. The "Glitchy" Approach: The mug looks great in the first second, but by the third second, it starts melting, changing color, or turning into a different mug entirely.
  2. The "Sticker" Approach: The AI pastes a picture of the mug onto the hand, but the hand just floats above it. There's no real interaction; the mug doesn't look like it's actually being held.

GenHOI is a new tool designed to fix exactly this problem. Think of it as a "smart editor" that plugs into existing video AI models to teach them how to handle objects correctly.

Here is how it works, using some simple analogies:

1. The Problem: The "Fading Memory"

Standard AI models have a short attention span. When they look at a reference image (the "mug" you want to use) at the start of the video, they remember it well for the first few frames. But as the video plays, that memory fades. By the end, the AI forgets what the mug looked like and starts hallucinating a new one.

2. The Solution: GenHOI's Two Superpowers

GenHOI adds two special "brain upgrades" to the AI to solve this.

Upgrade A: The "Sliding Spotlight" (Head-Sliding RoPE)

  • The Analogy: Imagine you are trying to remember a song. If you only listen to the first note once, you'll forget it by the end of the song. But if you have a chorus that repeats and shifts slightly throughout the song, you remember it perfectly.
  • How it works: Usually, the AI looks at the reference object (the mug) as if it only exists at "Time Zero." GenHOI changes this. It takes the information about the mug and "slides" it across the timeline of the video, spreading it out evenly.
  • The Result: Instead of the AI forgetting the mug after 2 seconds, the "memory" of the mug is refreshed in every single frame. The mug stays the same color, shape, and logo from start to finish, even in a long video.

Upgrade B: The "Smart Mask" (Spatial Attention Gate)

  • The Analogy: Imagine you are painting a picture of a person holding a cup. You want to be very careful and detailed when painting the cup and the hand holding it. But when you paint the background (the wall or the table), you don't want to accidentally smear the cup's design onto the wall.
  • How it works: GenHOI puts up a digital "fence."
    • Inside the fence (The Hand & Object): The AI is allowed to look at the reference photo and copy the mug's details perfectly.
    • Outside the fence (The Background): The AI is strictly told, "Do not look at the mug photo here." It must rely on the original video for the background.
  • The Result: The hand and mug look hyper-realistic and consistent, but the background doesn't get weird artifacts or weirdly change color because the AI got confused.

3. Why This Matters

Before GenHOI, if you wanted to make a video for an e-commerce site showing a model holding your new shoe, you'd have to film it in a studio with perfect lighting.

With GenHOI, you can take a video of a person walking down the street (even if the lighting is messy) and swap their empty hand for a video of them holding your specific shoe. The shoe will look real, stay consistent, and the hand will grip it naturally, not just float above it.

Summary

  • Old Way: The AI forgets the object halfway through the video or pastes it on like a sticker.
  • GenHOI Way: It uses a "Sliding Spotlight" to keep the object's memory fresh throughout the whole video, and a "Smart Mask" to ensure the object only affects the hand, not the background.

It's like giving the AI a pair of glasses that helps it focus on the object it needs to hold, while ignoring everything else, ensuring the object never changes its mind about what it looks like.