Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Imagine you are having a coffee chat with a friend. It's not just about what you say; it's about the whole dance of the conversation. When you speak, your friend nods, raises an eyebrow, or leans in. When they speak, you do the same. You are constantly reacting to each other in real-time.

Now, imagine trying to teach a computer to be that friend. That is the challenge this paper, TIMAR, solves.

Here is the story of how they did it, using some simple analogies.

The Problem: The "Scripted" Robot vs. The "Real" Friend

Before TIMAR, most computer avatars were like actors reading from a script.

The "Talking" Robot: If you told a robot to speak, it would move its mouth perfectly. But if you stopped talking and just listened, the robot would freeze or stare blankly. It didn't know how to listen.
The "Listening" Robot: If you told a robot to listen, it might nod. But it couldn't seamlessly switch to talking when it was its turn.
The Old Way: Researchers tried to fix this by feeding the computer the entire conversation at once (like reading a whole book before writing a sentence). This works for movies, but it's terrible for a live chat. If you are on a video call, you can't wait for the other person to finish their whole story before you start reacting. You need to react now.

The Solution: TIMAR (The "Turn-Taking" Maestro)

The authors created a system called TIMAR. Think of TIMAR not as a robot, but as a conversational conductor who understands the rhythm of a duet.

Here is how it works, broken down into three simple steps:

1. The "Turn-Based" Game (Causal Modeling)

Imagine a game of tennis. You hit the ball (your turn), then your opponent hits it back (their turn). You can't hit the ball before they hit it back.

Old AI: Tried to see the whole match before making a move.
TIMAR: Plays strictly by the rules of the game. It only looks at what has happened so far. It treats the conversation as a series of "turns." When it's the agent's turn to react, it looks at everything that happened in the previous turns to decide what to do next. This ensures the robot never "cheats" by peeking at the future.

2. The "Interleaved" Sandwich (Multimodal Fusion)

In a real conversation, speech and body language are mixed together like ingredients in a smoothie.

Old AI: Often treated speech and head movements as separate ingredients (e.g., "First calculate the mouth, then calculate the head").
TIMAR: Makes a sandwich. It takes a slice of your speech, a slice of your head nod, a slice of the robot's speech, and a slice of the robot's head nod, and stacks them all together in one timeline.
Why? This allows the AI to see the connection: "Oh, the user nodded while saying 'yes', so the robot should probably smile and say 'great'." It understands the relationship between what is said and how it's said.

3. The "Diffusion" Artist (Lightweight Diffusion Head)

This is the magic part. How does the robot decide exactly how to move its head?

Old AI: Tried to guess the exact position of the head, like a robot trying to draw a straight line. This often looked stiff or "uncanny."
TIMAR: Uses a Diffusion process. Imagine a sketch artist who starts with a blurry, noisy cloud of ink. Slowly, they refine the cloud, removing the noise, until a clear, expressive face emerges.
The Magic: Because it starts with "noise," the robot doesn't just copy the exact same nod every time. It adds natural human variability. Sometimes it nods quickly, sometimes slowly. It captures the "soul" of the movement, not just the math.

Why This Matters (The Results)

The paper tested TIMAR against the previous best system (called DualTalk) using a benchmark called DualTalk.

Realism: TIMAR's robot looked 15–30% more natural. It didn't look like a puppet; it looked like a person.
Streaming: Because TIMAR plays by "turn" rules, it can work in real-time video calls. It doesn't need to wait for the whole conversation to finish.
Robustness: If the internet cuts out for a second (missing data), TIMAR is less likely to crash or look weird compared to older models. It's like a good conversationalist who can keep the flow going even if you stumble over your words.

The Big Picture

TIMAR is a breakthrough because it finally treats a conversation as a continuous, two-way dance rather than two separate monologues.

It respects the timing (it waits for its turn).
It respects the context (it remembers what you said 10 seconds ago).
It respects the human element (it moves with natural, slightly unpredictable flair).

In short, TIMAR is the first step toward digital avatars that don't just simulate a conversation, but actually feel like they are in one with you.

1. Problem Statement

The paper addresses the challenge of generating interactive 3D conversational head dynamics for embodied agents (e.g., social robots, virtual avatars). Current state-of-the-art methods face two primary limitations:

Decoupled Processes: Most frameworks treat "talking" (speech-driven motion) and "listening" (reactive non-verbal cues) as independent tasks, failing to capture the bidirectional, intertwined nature of human dialogue.
Non-Causal Modeling: Existing dual-speaker models (e.g., DualTalk) often rely on full-sequence bidirectional attention. While effective for offline synthesis, this approach exposes future information during training, making it unsuitable for streaming, real-time applications where the agent must respond turn-by-turn based only on past and current context.

The goal is to synthesize an agent's 3D head motion (expression, jaw, pose) that coherently reflects both speaking and listening behaviors, conditioned on the evolving multimodal context of a dyadic conversation, while strictly adhering to causal constraints for streaming compatibility.

2. Methodology: The TIMAR Framework

TIMAR (Turn-level Interleaved Masked AutoRegression) is an autoregressive-diffusion framework designed to model conversation as a sequence of interleaved audio-visual turns.

A. Interleaved Audio-Visual Context

Tokenization: Speech is tokenized using a pretrained wav2vec 2.0 model, and 3D head motion (FLAME parameters) is encoded via a learnable MLP encoder. Both are projected into a shared token space ( $d_t = 1024$ ).
Turn Segmentation: The conversation is divided into fixed-length turns (e.g., 1-second chunks).
Interleaving: For each turn, tokens from the User (Speech, Head) and Agent (Speech, Head) are interleaved into a single sequence: $[S_u, S_a, H_u, H_a]$ .
Masking: During training, the Agent's head tokens ( $H_a$ ) are masked with a learnable mask token, forcing the model to reconstruct them based on the visible context.

B. Turn-Level Causal Multimodal Fusion

The core of the architecture is a Transformer encoder equipped with Turn-Level Causal Attention (TLCA):

Intra-turn Communication: Within a single turn, tokens attend bidirectionally. This allows the model to align the agent's speech with the user's speech and head motion, and vice versa, capturing fine-grained synchronization.
Inter-turn Communication: Across turns, attention is strictly causal. A turn can only attend to its own tokens and the tokens of preceding turns. This prevents future leakage and enables streaming generation.
Separator Tokens: Special learnable tokens are inserted between modalities and turns to help the model distinguish boundaries and stabilize causal reasoning.

C. Lightweight Diffusion Head

Instead of regressing motion directly (which often leads to over-smoothing), TIMAR uses a Lightweight Diffusion Head:

Conditional Denoising: The model predicts the distribution of the masked Agent head parameters ( $H_a$ ) conditioned on the fused context ( $Z$ ).
Stochasticity: By modeling motion as a continuous probabilistic process, the diffusion head captures the natural variability and multimodality of human expressions (e.g., multiple valid ways to nod) rather than averaging them out.
Architecture: A compact MLP-based diffusion network that performs token-wise denoising in continuous parameter space.

D. Training and Sampling Strategy

Training: The model is trained using a masked prediction objective (similar to MAE) combined with a diffusion loss. Classifier-Free Guidance (CFG) is employed by randomly dropping user tokens to teach the model to generate motion with or without specific user cues.
Sampling (Inference): The process is autoregressive. For each new turn, the model:
1. Collects the current turn's User/Agent speech and User head.
2. Fills Agent head with mask tokens.
3. Concatenates with a buffer of previous turns (context history).
4. Runs the diffusion process to denoise the Agent head tokens.
5. Appends the predicted tokens to the buffer for the next turn.

3. Key Contributions

Causal Turn-Level Formulation: The first framework to formulate interactive 3D head generation as a strictly causal, turn-wise prediction problem, enabling true streaming generation without future information leakage.
Interleaved Multimodal Fusion: A novel architecture that unifies talking and listening within a single interleaved sequence, using TLCA to learn both intra-turn alignment and inter-turn dependencies simultaneously.
Diffusion-Based Generative Decoding: Introduction of a lightweight diffusion head that models 3D motion as a stochastic process, successfully capturing expressive variability and temporal coherence better than deterministic regression.

4. Experimental Results

The model was evaluated on the DualTalk benchmark (50 hours of dual-speaker conversations) and an Out-of-Distribution (OOD) dataset.

Quantitative Performance:
- TIMAR achieves 15–30% relative improvement in Fréchet Distance (FD) and Paired Fréchet Distance (P-FD) compared to the previous state-of-the-art (DualTalk) on the test set.
- It maintains consistent gains (5–10% improvement) on OOD data, demonstrating superior generalization.
- It outperforms single-speaker baselines (FaceFormer, CodeTalker, etc.) in both realism and synchronization metrics.
Ablation Studies:
- Diffusion vs. MLP: Replacing the diffusion head with a direct MLP predictor leads to overfitting and lower test performance, confirming the benefit of modeling stochasticity.
- Causal vs. Bidirectional: Removing the causal constraint (using full bidirectional attention) degrades streaming performance and temporal consistency.
- Parameter Efficiency: TIMAR achieves lower errors than DualTalk even with comparable or smaller parameter counts, proving the efficiency of the causal formulation.
Robustness: TIMAR is significantly more robust to input corruption (e.g., missing user head frames or agent speech failures) than DualTalk, which suffers catastrophic degradation when its primary driving signal (agent speech) is removed.
User Study: In a blind A/B test with 500 pairs, participants preferred TIMAR over DualTalk in Motion Naturalness (61%), Facial Expression (59%), Interaction Naturalness (57%), and Lip-sync (60%).

5. Significance

TIMAR represents a paradigm shift in conversational avatar generation. By moving away from offline, full-sequence modeling to a causal, turn-level autoregressive approach, it bridges the gap between high-fidelity synthesis and real-time interactivity.

Realism & Responsiveness: It produces agents that not only speak realistically but also listen and react with appropriate non-verbal cues (nodding, gaze shifts) that evolve naturally over the course of a conversation.
Streaming Capability: The strict causal formulation makes TIMAR suitable for live telepresence, virtual assistants, and social robotics where latency and real-time feedback are critical.
Generalization: Its ability to handle OOD data and input noise suggests it is a robust solution for real-world deployment where input data may be imperfect.

The source code is released at CoderChen01/towards-seamless-interaction.