ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ExpGest is a novel diffusion-based framework that generates expressive, controllable full-body gestures by leveraging synchronized audio and text guidance, along with a specialized noise emotion classifier, to overcome the limitations of existing methods that often produce stiff, upper-body-only movements.

Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Gaoge Han, Jifeng Ning, Wei Liu

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are directing a movie, but instead of hiring a real actor, you are asking a computer to create a virtual character who can talk, walk, and emote. The problem is, most computer programs that do this are like robotic puppets. They might move their hands when the voice gets loud, but they look stiff, they don't understand what is being said, and they can't really "act" out a scene where the character walks in a circle while talking.

This paper introduces ExpGest, a new system that teaches computers to be much better actors. Think of it as upgrading from a wind-up toy to a full-blown method actor.

Here is how it works, broken down into simple concepts:

1. The "Two-Brain" Approach (Audio + Text)

Most old systems only listen to the music of the voice (the melody, the pitch, the speed). If you say "I'm angry!" in a calm, flat voice, the old robot might just shrug because the voice sounds calm.

ExpGest has two brains working together:

  • Brain A (The Musician): Listens to the sound of the voice (is it loud? is it fast?).
  • Brain B (The Reader): Reads the words being spoken (is the person saying "I'm furious" or "I'm happy"?).

By combining these, the computer understands that even if the voice is quiet, the words might be screaming with anger. This allows the character to make the right gestures, like clenching a fist, even if the voice isn't shouting.

2. The "Magic Paintbrush" (The Diffusion Model)

How does the computer actually move the character? It uses something called a Diffusion Model.

Imagine you have a clear, beautiful painting of a person dancing. Now, imagine slowly adding static noise (like TV snow) to it until the painting is completely unrecognizable white noise.

  • The Old Way: The computer tries to guess the next step of the dance based on a rigid rulebook.
  • The ExpGest Way: The computer starts with a blank canvas full of "TV snow" (random noise). It then slowly "denoises" the image, peeling away the static layer by layer, guided by your audio and text instructions, until a perfect, fluid dance emerges from the chaos. It's like sculpting a statue out of fog.

3. The "Finger vs. Arm" Trick

The researchers noticed something interesting about how humans move:

  • Fingers are like sensitive birds; they react to the meaning of the words (e.g., counting "one, two, three" with fingers).
  • Arms and Legs are like heavy anchors; they react to the emotion and volume of the voice (e.g., waving arms wildly when shouting).

ExpGest separates these. It tells the computer: "Use the words to move the fingers, but use the tone of voice to move the arms." This makes the movement feel incredibly natural, rather than a robot moving every joint at the same time.

4. The "Emotion Coach" (Noise Classifier)

Usually, if you want a character to be "angry," you just tell the computer "Set emotion to 1." This is like giving a robot a single button to press. It often results in a fake, stiff anger.

ExpGest uses a clever trick called a Noise Emotion Classifier.

  • Imagine the computer is trying to draw a picture, but it's stuck in a foggy room.
  • The "Emotion Coach" is a guide standing outside the fog. Every time the computer makes a move, the coach whispers, "You're getting too calm, lean a little more toward 'anger'!"
  • The computer adjusts its "noise" (its random guesses) to follow the coach's direction. This allows the character to transition smoothly from happy to sad, or calm to furious, without looking glitchy.

5. The "Walking and Talking" Combo

Finally, most systems can only make a character stand still and wave their hands. ExpGest is the first to let the character locomote (walk, run, sit down) while talking.

Think of it like a scene where a character is pacing back and forth while giving a speech. ExpGest can take a text instruction like "The character walks in a circle, then sits down" and combine it with the audio of the speech. It creates a full-body performance where the walking and the talking happen at the same time naturally.

The Bottom Line

ExpGest is like giving a virtual actor a script, a director, and a full understanding of human psychology. Instead of just moving to the beat of the music, the character now understands the story, feels the emotions, and moves their whole body (from their walking feet to their expressive fingers) in a way that feels truly human.

The result? Virtual speakers that don't just look like they are talking; they look like they are living the conversation.