DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction videos by utilizing sparse motion guidance (wrist coordinates and object bounding boxes), an object-stressed attention mechanism, and a multi-task auxiliary training strategy to overcome limitations in flexibility, generalization, and data scarcity.

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are a movie director. You have a script (a text prompt) and you want to film a scene where an actor picks up a specific prop, like a coffee mug, and takes a sip.

In the past, asking a computer to do this was like giving a vague instruction to a chaotic intern: "Make a video of a guy drinking coffee." The computer might give you a guy drinking a giant soda, or a guy floating in space, or a guy whose hand passes right through the mug like a ghost. It was hard to control exactly what happened.

Other methods tried to fix this by giving the computer a "template" video (a reference film) and asking it to copy the movements. But this was like trying to paint a new picture by only being allowed to trace over an old one. You couldn't easily swap the coffee mug for a toaster, or change the actor's movements without breaking the whole video.

Enter DISPLAY.

The paper introduces DISPLAY, a new AI system that acts like a super-smart, obedient director's assistant. It solves the problem of making videos where humans interact with objects in a way that looks real, feels physical, and follows your exact instructions.

Here is how it works, broken down into simple concepts:

1. The "Sparse Motion Guidance" (The Minimalist Sketch)

Most previous systems tried to control the video by drawing a complex skeleton for the whole body and a 3D map for the object. It was like trying to direct a play by writing a 50-page script for every single muscle twitch.

DISPLAY takes a different approach. It uses Sparse Motion Guidance.

  • The Analogy: Imagine you are directing a puppet show. Instead of telling the puppeteer exactly how to move every finger, you just give them two simple instructions:
    1. Where the wrist goes: You draw a simple line on the screen showing where the actor's hand should move (start here, end there).
    2. Where the object is: You draw a simple box around where the object (like the mug) should be.
  • Why it's cool: This is "shape-agnostic." It doesn't matter if the object is a round mug, a flat iPad, or a weirdly shaped banana. The AI just knows, "The hand goes to the box." It fills in the rest of the details (how the fingers curl, how the light hits the object) on its own. This gives you total freedom to swap objects without retraining the AI.

2. The "Object-Stressed Attention" (The Object's Bodyguard)

When you ask an AI to generate a video, it often gets distracted. It focuses so much on the human's face or the background that the object they are holding looks weird, melts, or disappears.

DISPLAY introduces a mechanism called Object-Stressed Attention.

  • The Analogy: Imagine the AI is a student taking a test. Usually, the student ignores the difficult questions about the "object" and just guesses. Object-Stressed Attention is like a strict teacher who taps the student on the shoulder and says, "Hey, look at the object! Make sure it stays solid, looks like the reference photo, and doesn't pass through the hand."
  • The Result: The AI pays extra attention to the object, ensuring it looks real and interacts physically correctly with the human hand.

3. The "Multi-Task Auxiliary Training" (The Intern's Internship)

The biggest problem with teaching AI to do Human-Object Interaction (HOI) is that there aren't enough high-quality videos of people holding specific things. It's like trying to teach a chef to make a specific rare dish, but you only have 10 examples of that dish in the whole world.

DISPLAY solves this with Multi-Task Auxiliary Training.

  • The Analogy: Instead of only studying the 10 rare dishes, the AI goes to a massive culinary school where it learns to cook everything. It learns to chop vegetables, boil water, and plate food (general human motion).
  • The Strategy: The system trains on a mix of:
    1. The rare, perfect videos of people holding objects (the specific task).
    2. Thousands of videos of people just moving around, without specific objects (the general task).
  • The Result: By learning the general rules of movement from the huge dataset, the AI becomes much better at the specific task of holding objects, even when it hasn't seen that exact object before.

What Can You Actually Do With It?

The paper shows three main superpowers:

  1. Object Replacement: You have a video of a guy holding an iPad. You want him to hold a toaster instead. You upload a picture of a toaster, and DISPLAY swaps the iPad for the toaster, making the hands grip it naturally.
  2. Object Insertion: You have a video of a guy sitting at an empty table. You want him to pick up a mug that wasn't there before. You draw a box where the mug should be and a line for his hand to reach it. The AI invents the mug and the interaction from scratch.
  3. Environmental Interaction: You have a video of a guy walking past a chair. You want him to stop and sit down. You draw a path for his hand to touch the chair, and the AI animates the whole sitting motion.

The Bottom Line

DISPLAY is like giving a director a magic wand. Instead of needing a full script, a 3D model, and a template video, you just need a simple sketch of where the hand goes and a picture of the object. The AI fills in the physics, the lighting, and the realistic details, allowing you to create high-quality, controllable videos of people interacting with the world around them.