MIBURI: Towards Expressive Interactive Gesture Synthesis

Imagine you are talking to a digital assistant, like a very advanced Siri or Alexa. Right now, these assistants are like talking heads or text bubbles. They can understand your words and reply with words, but they have no body. They don't nod when they agree, they don't shrug when they are confused, and they don't wave when they say hello. They are stuck in a "text-only" world.

The paper introduces MIBURI, a new system designed to give these digital assistants a full body that moves naturally while they talk. Think of MIBURI as the "Body Language Coach" for AI.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Scriptwriter" vs. The "Improviser"

Most current systems that make digital characters move are like scriptwriters. They wait until the whole sentence is finished, read the entire script, and then decide what the character should do.

The Issue: In real life, humans don't wait for the whole sentence to finish to start moving. We gesture while we speak. If a system waits for the whole sentence, the movement feels robotic, delayed, or out of sync.
The Old Way: "I will listen to your whole story, then I will calculate the perfect dance moves." (Too slow, feels fake).

2. The Solution: The "Live Improviser"

MIBURI is different. It is an online, causal framework. In plain English, this means it is an improviser.

It listens to your voice as it happens.
It decides on a hand wave or a head nod immediately based on the sound it just heard.
It doesn't need to know what you are going to say in the future to know what to do right now.

The Analogy: Imagine a jazz musician.

Old Systems: They wait for the whole song to be written down before they play a single note.
MIBURI: It listens to the rhythm of the music as it's being played and instantly improvises a melody that fits perfectly.

3. How It Does It: The "Secret Decoder Ring"

The magic behind MIBURI is how it connects the voice to the body.

The Brain: It uses a powerful AI model called Moshi (which is great at understanding speech and text).
The Secret Sauce: Instead of waiting for the AI to finish speaking and then translating those words into gestures, MIBURI taps directly into the internal "thoughts" (tokens) of the AI as it is generating them.
The Metaphor: Imagine a puppeteer.
- Old way: The puppeteer reads the script, then pulls the strings.
- MIBURI: The puppeteer has a direct wire from the actor's brain to the puppet's strings. As soon as the actor thinks "I'm excited," the puppet's arms jump up instantly.

4. The "Body Parts" Strategy

One of the paper's clever tricks is how it handles the body. It doesn't try to move the whole body as one giant blob. It breaks the body into three teams:

The Face: For expressions (smiles, frowns).
The Upper Body: For hand gestures and arm movements.
The Lower Body: For walking or shifting weight.

It treats these like three different musicians in a band. They all listen to the same "speech rhythm" but play their own specific parts. This allows for complex, natural movements (like a head nod while the hands are still) without the system getting confused.

5. Why It Matters: The "Uncanny Valley"

When digital characters move poorly, it feels creepy (the "Uncanny Valley"). They look like zombies.

MIBURI is designed to be expressive. It doesn't just move; it conveys emotion.
It prevents the character from freezing into a statue (a common problem with AI) by using special math tricks to ensure the movements are diverse and lively.

Summary

MIBURI is a breakthrough because it finally allows digital assistants to have real-time body language.

Before: You talk to a robot that stands still, then suddenly waves its hand after you finish speaking.
With MIBURI: You talk to a robot that nods, smiles, and gestures in the exact moment you are speaking, just like a human friend would.

It bridges the gap between "smart computer" and "human-like companion," making our future conversations with AI feel much more natural and less like talking to a calculator.

1. Problem Statement

Embodied Conversational Agents (ECAs) aim to replicate human face-to-face interaction using speech, gestures, and facial expressions. However, current solutions face significant limitations:

LLM-based Agents: While strong in language, they lack embodiment and the non-verbal cues (gestures) essential for natural interaction.
Existing Gesture Synthesis:
- Rule-based/Early Data-driven: Often produce rigid, low-diversity motions and artificial turn-taking patterns (distinct speaking/listening phases).
- Modern Generative Models (Diffusion/Transformers): Produce natural, expressive gestures but operate offline and non-causally. They require access to future speech context to generate motion, making them unsuitable for real-time, interactive dialogue where future input is unknown.
The Gap: There is a lack of a framework that is simultaneously causal (relying only on past/present input), real-time (low latency), and expressive (diverse, human-like motion).

2. Methodology: MIBURI Framework

MIBURI is an online, fully causal generative framework designed to synthesize full-body gestures and facial expressions synchronized with real-time spoken dialogue.

Core Architecture & Pipeline

Foundation Model (Moshi):
- Instead of a traditional pipeline where text is converted to speech and then to gestures, MIBURI leverages Moshi, a speech-text foundation model.
- It directly taps into Moshi's internal semantic and acoustic token streams (text and speech tokens generated in parallel). This avoids the latency of converting text-to-speech and then re-tokenizing for gesture generation.
Body-Part Aware Gesture Codecs (RVQ-VAE):
- To handle the complexity of full-body motion, gestures are decoupled into three regions: Upper Body (hands/arms), Lower Body (legs/global translation), and Face (FLAME parameters).
- Each region is encoded using a Residual Vector Quantization (RVQ) VQ-VAE. This discretizes motion into multi-level tokens, capturing both coarse movements and fine-grained kinematic details (e.g., finger gestures).
- Tokens are generated for a short temporal window (2 frames) to minimize latency.
Two-Dimensional Causal Generators:
- To manage the complexity of predicting tokens across both time ( $T$ $T$ ) and kinematic levels ( $K$ $K$ ), MIBURI uses two specialized autoregressive transformers:
  - Temporal Transformer: Predicts the first kinematic level ( $g_{t,1}$ ) based on past time steps and current speech/text embeddings. It focuses on temporal dynamics.
  - Kinematic Transformer: Predicts subsequent kinematic levels ( $g_{t,k}$ ) for the current time step, conditioned on the output of the temporal transformer and previous levels within the same frame. It focuses on hierarchical motion details.
- Both models use causal self-attention (no future context) and cross-attention to Moshi's tokens.
Training Objectives:
- Cross-Entropy Loss ( $L_{CE}$ ): Standard token prediction loss.
- Contrastive InfoNCE Loss ( $L_{con}$ ): To prevent the model from collapsing into static or repetitive "mean" poses, a contrastive loss is applied to the latent space (using Gumbel-Softmax for differentiability). This encourages diverse and expressive motion trajectories.
- Voice Activation Loss ( $L_{va}$ ): A binary classification head distinguishes between "speaking" and "listening" states, ensuring gestures are suppressed or altered appropriately during listening phases to avoid "phantom" gestures.

3. Key Contributions

New Paradigm for Real-Time Synthesis: The first framework to generate expressive, full-body co-speech gestures and facial expressions in a fully causal and real-time manner, directly utilizing the internal token stream of a speech-text foundation model.
Novel Architecture: A two-dimensional causal generation approach (Temporal + Kinematic Transformers) combined with Body-Part Aware RVQ Codecs. This decouples temporal dynamics from kinematic hierarchy, enabling efficient low-latency inference without sacrificing expressiveness.
Expressiveness Enhancements: Introduction of auxiliary objectives (Contrastive Loss and Voice Activation Loss) that explicitly train the model to avoid mode collapse and maintain distinct behaviors for speaking vs. listening states.
Comprehensive Evaluation: Extensive quantitative and perceptual evaluations demonstrating state-of-the-art performance in multi-speaker settings, outperforming both offline generative models and naive causal baselines.

4. Results & Evaluation

The authors evaluated MIBURI on the BEAT2 dataset (single and multi-speaker settings) and the Embody3D dataset.

Perceptual Evaluation: In user studies, MIBURI was preferred over non-causal baselines (like EMAGE and GestureLSM) for naturalness and speech appropriateness, though it still trails slightly behind Ground Truth (GT) data.
Quantitative Metrics:
- Multi-Speaker Setting: MIBURI achieved State-of-the-Art (SOTA) results in Fréchet Gesture Distance (FGD) and Beat Alignment (prosodic synchronization), significantly outperforming causal versions of other real-time methods.
- Diversity: The model maintained high motion diversity ( $L1$ -Divergence) without needing seed sequences.
Latency:
- MIBURI achieves 36ms latency per frame on an RTX 3090.
- It generates 2 frames per step (0.08s of motion), aligning with Moshi's token rate.
- It is significantly faster than diffusion-based methods (which require waiting for full context) and outperforms other real-time methods in latency while maintaining higher quality.
Ablation Studies:
- Using Moshi's internal tokens outperformed standard Wav2Vec encodings in both quality and speed.
- The Two-Transformer design was crucial; a single transformer resulted in worse convergence and higher latency due to expanded context windows.
- Contrastive Loss was proven essential for improving FGD and diversity compared to MSE loss.

5. Significance

MIBURI represents a critical step toward truly interactive Embodied Conversational Agents. By solving the "causality vs. expressiveness" trade-off, it enables digital assistants to:

React instantly to user input without waiting for future context.
Exhibit natural, diverse, and context-aware body language (including distinct behaviors for listening vs. speaking).
Operate in real-time web or mobile environments with low latency.

This work bridges the gap between high-quality generative AI and the strict timing constraints required for fluid human-computer interaction, moving ECAs closer to the vision of seamless, human-like communication.

MIBURI: Towards Expressive Interactive Gesture Synthesis

1. The Problem: The "Scriptwriter" vs. The "Improviser"

2. The Solution: The "Live Improviser"

3. How It Does It: The "Secret Decoder Ring"

4. The "Body Parts" Strategy

5. Why It Matters: The "Uncanny Valley"

Summary

1. Problem Statement

2. Methodology: MIBURI Framework

Core Architecture & Pipeline

3. Key Contributions

4. Results & Evaluation

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization