Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

This paper presents a unified physics-based framework that leverages Vision-Language Models and a novel VLM-Guided Relative Movement Dynamics (RMD) representation to automatically generate reward functions for synthesizing scalable, long-horizon human-object interactions across diverse object types without manual reward engineering.

Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to do chores, like picking up a laundry basket, walking to the washing machine, and then sitting down to rest.

In the past, teaching robots this stuff was like trying to teach a dog a complex trick by physically moving its legs for every single step. You needed hours of expensive video footage of humans doing the task (motion capture), or you had to act as a very strict, tired coach who manually wrote down a thousand rules like "if the hand touches the basket, move forward 2 inches." If the robot encountered a slightly different basket or a different room, it would get confused and crash.

This paper introduces a new way to teach robots, which they call VLM-Guided Motion Policy. Here is how it works, using some simple analogies:

1. The "Smart Director" (The VLM)

Instead of a human coach writing rules, the team uses a Vision-Language Model (VLM)—think of it as a super-smart movie director who has watched millions of movies and read every book on physics.

  • How it works: You show the robot a picture of the room and say, "Pick up the basket and put it by the washer."
  • The Magic: The "Director" doesn't just say "Go." It imagines the whole scene in its head. It breaks the task down into a storyboard: First, walk to the basket. Then, bend down. Then, grab the handle. Then, stand up while keeping the basket steady. Finally, walk to the washer.

2. The "Dance Map" (Relative Movement Dynamics - RMD)

This is the paper's biggest innovation. In the past, robots were told to move their "hand" to a specific "spot." But holding a basket is tricky; your hand, your arm, your hips, and your feet all have to move in a specific relationship to the basket.

The authors created a system called RMD (Relative Movement Dynamics).

  • The Analogy: Imagine a dance floor. Instead of telling the robot "Move your left foot to coordinate X," the Director draws a map of relationships.
    • Rule 1: "Your hands must stay glued to the basket (Distance = 0)."
    • Rule 2: "Your hips must move closer to the basket as you bend (Distance = Getting Smaller)."
    • Rule 3: "Your feet must stay on the floor and not slide (Distance = Stable)."
  • Why it's cool: This map is flexible. If the basket is heavy, the robot knows its legs need to push harder to keep that "glued" relationship. If the basket is light, it can move faster. The robot learns the dance, not just the steps.

3. The "Auto-Referee" (Automatic Rewards)

In robot training, the robot learns by getting points (rewards) for doing things right. Usually, humans have to decide what "right" looks like.

  • The Old Way: A human writes a rule: "If the robot drops the basket, -10 points."
  • The New Way: The "Director" (VLM) looks at the "Dance Map" (RMD) and automatically creates the scoring system. It tells the robot: "You get points for keeping your hand close to the basket while your feet move forward."
  • Result: The robot learns to move naturally because the "referee" understands the intent of the movement, not just the final position.

4. The "New Playbook" (The Interplay Dataset)

To train this system, the authors created a massive new library of tasks called Interplay.

  • Think of this as a library of thousands of "chore scenarios." Some involve static objects (a chair you sit on), some involve moving objects (a box you push), and some involve complex machines (a washing machine with a door that opens).
  • This ensures the robot doesn't just learn to sit on one specific chair, but learns the concept of "sitting" on any chair in any room.

The Result: A Natural Dancer

When they tested this system:

  • Old Robots: Often looked stiff, jittery, or fell over when they tried to stand up after sitting. They were like a puppet with tangled strings.
  • This New Robot: Moves smoothly. It grabs the basket, walks without dropping it, puts it down, and then stands up to walk away—all in one fluid motion. It looks like a human doing a chore, not a machine executing code.

Summary

This paper is about giving robots a creative imagination and a flexible understanding of relationships rather than just a rigid list of instructions. By using an AI "Director" to draw a "Dance Map" of how body parts relate to objects, the robot can learn to perform complex, long tasks (like doing laundry) naturally and without needing a human to micromanage every single movement.