ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

The paper proposes ParTY, a novel framework that improves text-to-motion synthesis by introducing part-guided networks, part-aware text grounding, and holistic-part fusion to overcome the limitations of existing methods in aligning specific body part actions with text while maintaining full-body motion coherence.

KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are a director trying to tell a computer to act out a scene. You say, "The actor walks forward, kicks a ball with their right foot, and waves with their left hand."

In the past, computers trying to do this had two main problems:

  1. The "Blurry Photo" Problem (Holistic Methods): Some computers treated the human body like a single, giant blob. They were great at making the whole person move smoothly (like a dance), but if you asked for a specific kick, they often just made the whole person stumble or forgot which foot to use. They couldn't focus on the details.
  2. The "Frankenstein" Problem (Part-Wise Methods): Other computers tried to fix this by building the body piece by piece. They generated the arm movement, then the leg movement, and then the torso movement separately. The problem? When they stitched these pieces together, the result looked weird. The arms might be waving while the legs were running in the opposite direction, or the neck might twist at an impossible angle. The parts didn't talk to each other.

Enter ParTY: The "Conductor" of the Body.

The paper introduces a new system called ParTY (Part-Guidance for Expressive Text-to-Motion Synthesis). Think of ParTY not as a builder, but as a conductor leading an orchestra.

Here is how ParTY works, using simple analogies:

1. The "Translator" (Part-aware Text Grounding)

Imagine you have a complex instruction: "Kick with the right foot."

  • Old computers heard "Kick" and made the whole body jump.
  • ParTY's Translator is like a smart assistant who breaks that sentence down. It says, "Okay, the Right Leg needs to swing hard. The Left Leg needs to stand still. The Arms need to balance."
  • It takes your text and creates specific "scripts" for each body part, ensuring the right foot knows exactly what to do while the left foot knows to stay put.

2. The "Rehearsal" (Part-Guided Network)

This is the secret sauce.

  • Old "Frankenstein" methods tried to build the whole body at once, or build parts in isolation and glue them together.
  • ParTY does a rehearsal. First, it quickly acts out just the parts (the right leg kicking, the left arm waving) for a few seconds. It doesn't show this to the audience yet; it's just a "guide."
  • Then, it uses this rehearsal as a map to generate the full-body motion. Because the conductor (the system) has already seen how the leg should move, it can guide the rest of the body to move in sync with it. It's like a dance instructor doing a move slowly first, then having the whole class follow along perfectly.

3. The "Glue" (Holistic-Part Fusion)

Even with a rehearsal, you need to make sure the dancers stay connected.

  • ParTY has a special "glue" mechanism. As it generates the full-body motion, it constantly checks: "Hey, the arm is moving here, so the torso needs to lean this way to balance it."
  • It fuses the specific part actions with the overall flow, ensuring that when the right foot kicks, the left arm swings naturally to counterbalance it. No more twisted necks or floating limbs.

The Result?

ParTY solves the trade-off.

  • Before: You had to choose between a smooth, coherent dance (but with wrong moves) OR a specific, detailed move (but with a broken, disjointed body).
  • Now: You get both. The actor kicks the ball with the exact right foot, waves with the exact left hand, and the whole body moves as a single, natural, coordinated human.

Why is this a big deal?

The authors also invented new ways to measure success. Before, they could only measure if the whole dance looked good. Now, they can measure:

  • Did the right foot actually kick? (Part-Text Alignment)
  • Did the body look like a real human, or a glitchy robot? (Coherence)

ParTY wins at both. It's like finally teaching a computer to understand that a human body is a team of parts working together, rather than a single blob or a pile of disconnected Lego bricks.