Text-Driven Emotionally Continuous Talking Face Generation

Imagine you are directing a movie. In the past, if you wanted an actor to look "angry," you could tell them, "Be angry!" and they would stay angry the whole time. But in real life, emotions are messy and fluid. You might start off furious, then slowly calm down as you explain your side of the story, or maybe you get scared in the middle of a sentence.

The Problem with Old Tech
Current "Talking Face" technology is like a robot actor who can only hold one pose. If you feed it a script and tell it to be "sad," the character will look sad from the very first second to the very last, even if the words they are saying suggest they are getting angry or happy. It's like a song where the volume never changes; it's flat and unnatural.

The New Solution: TIE-TFG
This paper introduces a new system called TIE-TFG (Temporal-Intensive Emotion Modulated Talking Face Generation). Think of this system as a super-smart director who doesn't just give the actor a single instruction, but a detailed script of emotional shifts.

Here is how it works, broken down into simple metaphors:

1. The Scriptwriter (Text-to-Speech)

Instead of just typing "Be angry," you can type a description like: "Start off very angry, but gradually calm down as you speak."
The system first uses a powerful voice synthesizer to create audio that matches this description. It's like a voice actor who knows exactly when to shout and when to whisper based on your text instructions.

2. The Emotion Translator (The "Fluctuation Predictor")

This is the brain of the operation. The system takes that audio and the text and asks: "Okay, at this exact second, is the character 80% angry and 20% sad? Or is it 100% calm?"
It creates a timeline of emotions, second by second. Imagine a music equalizer that doesn't just show volume, but shows the mood changing with every beat. This allows the system to know that the character should look furious at the start of the sentence but relaxed by the end.

3. The Animator (The Visual Generator)

Finally, the system takes a photo of a person (the "Reference Image") and starts animating them. But instead of just moving their lips to match the words, it uses that emotion timeline to tweak their face.

The Lips: Move to the words.
The Eyebrows: Furrow when the "anger" score is high.
The Head: Nods or shakes when the "calm" score rises.

It's like having a puppet master who is pulling strings not just for the mouth, but for the entire face, changing the expression frame-by-frame to match the emotional story.

Why is this a Big Deal?

Realism: Humans are emotional chameleons. We don't stay in one mood for 30 seconds straight. This tech makes digital humans feel alive because their faces "breathe" with emotion.
Control: You aren't stuck with a fixed emotion. You can tell the AI, "Be happy, then suddenly scared," and it will actually do it.
The "Pseudo-Label" Trick: Since it's impossible to manually tag every single second of a video with an emotion (that would take forever), the researchers taught the AI to "guess" the emotions by watching thousands of real videos first. It's like teaching a student by showing them a thousand movies before asking them to direct their own.

The Result

The paper shows that this new method creates videos where the character's face changes naturally, just like a real person would. If the character is telling a joke that starts serious and ends funny, their face will actually shift from serious to a smile, rather than staying frozen in one expression.

In short, they moved from digital puppets that can only hold one pose, to digital actors who can feel, change their minds, and express a full range of human emotions in real-time.

1. Problem Statement

Talking Face Generation (TFG) aims to synthesize realistic videos of a person speaking based on driving data (audio or text). While existing methods have achieved high quality in neutral or fixed-emotion scenarios, they suffer from two critical limitations:

Static Emotion: Current emotional TFG models typically enforce a single, fixed target emotion (e.g., "Angry") throughout the entire video. They fail to capture the continuous, dynamic emotional fluctuations that occur naturally in human speech (e.g., starting angry and gradually calming down).
Audio-Visual Mismatch: Most emotional TFG is audio-driven. Changing the target emotion often results in a mismatch where the visual expression contradicts the emotional tone of the original audio signal. Furthermore, existing text-driven TFG methods often treat emotion as a fixed label rather than a dynamic sequence derived from the text content.

The authors propose a new task called Emotionally Continuous Talking Face Generation (EC-TFG). The goal is to generate a video where the speaker's facial expressions and head movements evolve continuously and naturally to match a specific emotion description (e.g., "Very angry, but gradually calming down") while speaking a given text.

2. Methodology: TIE-TFG

The authors propose TIE-TFG (Temporal-Intensive Emotion Modulated Talking Face Generation), a novel framework designed to handle dynamic emotional variations. The pipeline consists of three main stages:

A. Emotional Audio Generation

Input: Text ( $T$ ), Emotion Description ( $T_{emo}$ ), and optional Voice Reference ( $f_{voice}$ ).
Process: A large-scale Text-to-Speech (TTS) model (GLM-4-Voice) is used to synthesize an audio signal ( $A$ ) that incorporates the desired emotional variations described in the text.
Feature Extraction: The model extracts intermediate textual representations ( $f_t$ ) and uses an audio encoder to obtain audio features ( $f_a$ ).

B. Temporal-Intensive Emotion Fluctuation Modeling (EFP)

Challenge: Manually labeling frame-level emotions for hours of video is impractical.
Solution: The authors employ a pseudo-labeling strategy. They use a pre-trained facial expression model (ResEmoteNet) to predict emotion labels and intensity for every frame of reference videos, creating a dataset of frame-level emotional fluctuations.
Model Architecture: A multimodal encoder processes the audio features ( $f_a$ $f_{a}$ ) and textual features ( $f_t$ $f_{t}$ ). It is trained to predict the sequence of emotion labels ( $L_i$ $L_{i}$ ) corresponding to the audio/text sequence.
- Loss Function: Cross-entropy loss to predict the emotion label at each time step.
- Output: A continuous sequence of emotion fluctuation features representing how the emotion changes over time.

C. Emotion Fluctuation Guided Visual Synthesis

Backbone: The framework is built upon Stable Diffusion 1.5 (using a U-Net denoising network) and incorporates ReferenceNet to maintain visual consistency with a reference image ( $I$ ).
Motion Guide & Fusion:
- The system extracts facial landmarks (lip, expression, pose) using MediaPipe to create masks ( $M_{lip}, M_{exp}, M_{pose}$ ).
- Fusion Strategy: Audio features ( $f_a$ ) and emotion fluctuation features ( $f_{emo}$ ) are fused using a primary-guided weighted fusion strategy. A gating mechanism ( $g$ ) dynamically adjusts the weight of the emotion features based on the audio features.
- Cross-Attention: The fused features are injected into the diffusion model via a cross-attention mechanism. This allows the model to decouple lip, face, and head motion information, ensuring that the generated video reflects both the speech content and the specific emotional trajectory.

3. Key Contributions

New Task Definition (EC-TFG): The paper introduces the first task that synthesizes talking faces with continuous, natural emotional fluctuations driven by text descriptions, rather than fixed emotion labels.
Novel Framework (TIE-TFG): A text-driven framework that integrates an emotional TTS module, a temporal-intensive emotion fluctuation predictor, and an emotion-guided visual synthesis module. It enables synchronized editing of both audio and video emotions.
Dataset and Metrics:
- EC-HDTF: A newly annotated dataset containing over 10 hours of emotional videos with frame-level emotional fluctuation labels.
- Emotional Fluctuation Score (EF-score): A new evaluation metric that measures frame-by-frame consistency between the generated video's emotion and the target description, addressing the gap in evaluating dynamic emotion transitions.
State-of-the-Art Performance: The method achieves superior results in generating smooth emotional transitions and high visual fidelity compared to existing audio-driven and fixed-emotion text-driven methods.

4. Experimental Results

The authors evaluated TIE-TFG on the HDTF, LRS2, and MEAD datasets.

Quantitative Results:
- EF-score: TIE-TFG significantly outperforms baselines (e.g., 77.24% vs. 45.43% for Hallo on HDTF), demonstrating superior ability to model continuous emotional changes.
- Visual Quality: Achieved lower FID (17.94) and FVD (204.38) scores compared to most baselines, indicating higher realism.
- Synchronization: Maintained high lip-sync quality (Sync-D ~7.25).
- Emotion Accuracy: On the MEAD test set, TIE-TFG achieved an Emo-Acc of 84.05% and an EF-score of 67.58%, surpassing previous methods like EAT and EAMM.
Qualitative Results: Visual comparisons show that while baselines produce rigid, fixed expressions, TIE-TFG generates videos with natural, evolving expressions that match complex descriptions (e.g., "Angry but calming down").
Ablation Studies:
- Removing emotion fluctuation features caused a drastic drop in EF-score (from 77.24 to 45.43), proving the module's necessity.
- Combining both audio and text inputs for the fluctuation predictor yielded the best performance, confirming the value of multimodal integration.
- The system showed robustness even when the emotion predictor was slightly noisy.

5. Significance

This work represents a significant leap forward in digital human synthesis:

Realism: It bridges the gap between synthetic and real human behavior by modeling the temporal dynamics of emotion, which is crucial for applications in film production, virtual reality, and interactive avatars.
Controllability: By shifting from fixed labels to free-form text descriptions, it offers unprecedented fine-grained control over the emotional narrative of a video.
Synchronization: It solves the audio-visual emotion conflict problem by generating audio and video that evolve together, ensuring the synthesized content is coherent and believable.

In summary, TIE-TFG establishes a new standard for emotionally expressive talking face generation, moving beyond static expressions to create dynamic, human-like digital performances.