Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

Imagine you are at a party, chatting with a friend. You tell a sad story about losing your keys, and your friend immediately bursts into laughter. You'd probably feel confused, maybe even a little hurt. That's because their reaction didn't match the moment. In the world of Artificial Intelligence, computers often make this same mistake. They can generate a face that looks real, but the emotion on that face might be completely wrong for the conversation.

This paper introduces a new way to teach AI to be a better conversational partner. Here is the simple breakdown of how they did it:

The Problem: The "Robotic" Listener

Current AI systems are like actors who have memorized a script but don't understand the scene. If you say something angry, the AI might smile because it thinks, "I need to look friendly!" It doesn't understand social norms. It's like a waiter who brings you a birthday cake when you just asked for a glass of water because they think "cake is a good thing."

The researchers wanted to fix this so the AI listener reacts with the right emotion (like disgust when you say something gross, or sadness when you share bad news), not just a random "happy" face.

The Solution: A Two-Step Training Camp

The authors created a system that learns in two distinct phases, kind of like training a new employee.

Step 1: The "Shadowing" Phase (Supervised Fine-Tuning)

First, they taught the AI to be a good mimic. They showed it thousands of videos of real people having conversations. The AI learned to copy the listener's facial movements exactly, like a student shadowing a master.

The Analogy: Imagine a dance student watching a video of a pro dancer and trying to copy every move perfectly. At this stage, the AI is good at moving its face, but it doesn't really know why it's moving that way. It's just following orders.

Step 2: The "Human Critic" Phase (Reinforcement Learning)

This is the magic part. The researchers realized that copying isn't enough; the AI needs to learn what humans actually prefer.

The Analogy: Imagine the AI is a comedian trying out new jokes.
1. The AI tells a joke (generates a facial expression).
2. A human judge (the "Critic") watches it.
3. If the joke lands well (the expression matches the emotion), the judge gives a thumbs up.
4. If the joke falls flat (the AI smiles during a sad story), the judge gives a thumbs down.
5. The AI learns from this feedback: "Okay, I shouldn't smile when the speaker is sad."

The Secret Sauce: "Identity-Free" Feedback

One tricky problem the paper solves is that humans often judge faces based on how "pretty" or "realistic" the person looks, rather than the emotion itself.

The Analogy: If you ask people to judge a dance, and one dancer is wearing a sparkly costume while the other is in plain clothes, people might say the sparkly one is "better" just because of the costume.
The Fix: The researchers put the AI's face into a "neutral mask" (a generic 3D model) before showing it to humans. This way, humans judge only the expression (the dance move), not the costume (the face shape). This ensures the AI learns to be socially appropriate, not just visually appealing.

The Result: A Socially Smart AI

By combining the "mimicry" of Step 1 with the "human feedback" of Step 2, the AI becomes a much better listener.

Before: The AI might smile when you tell a sad story (Socially Awkward).
After: The AI frowns or looks concerned when you tell a sad story (Socially Appropriate).

Why This Matters

This isn't just about making robots look cute. It's about making human-computer interaction feel natural. Whether it's a virtual therapist, a customer service bot, or a video game character, we want them to understand the vibe of the conversation. This paper gives them the emotional intelligence to know when to smile, when to frown, and when to just listen quietly.

In short: They taught the AI to stop guessing what to do and start listening to human feedback on what feels right, turning a robotic mimic into a socially aware conversational partner.

Here is a detailed technical summary of the paper "Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction."

1. Problem Statement

The paper addresses the challenge of generating natural listener facial expressions in dyadic (two-person) interactions. While existing deep generative models (e.g., diffusion networks, GANs) can produce realistic facial animations conditioned on a speaker's cues, they often fail to align with human social preferences.

Core Issue: Current methods often generate expressions that are visually realistic but socially inappropriate (e.g., smiling when the speaker expresses disgust). This misalignment disrupts conversational flow and reduces user acceptance.
Limitations of Existing Approaches:
- Identity Bias: Feedback on generated expressions is often confounded by the listener's identity or visual appeal, making it hard to isolate expression quality.
- Open-Loop Generation: Most methods operate in an open-loop manner, lacking the ability to dynamically adapt to evolving conversational cues based on continuous feedback.
- Lack of Preference Alignment: Existing models rely on imitation learning from datasets where all samples are treated as "correct," failing to distinguish between high-quality social interactions and neutral or inappropriate responses.

2. Methodology

The authors propose a two-stage framework that treats facial expression generation as an action learning process within an identity-independent space. The method leverages a Vision-Language-Action (VLA) model refined by Human-Feedback Reinforcement Learning (HFRL).

A. Core Concept: Identity-Independent Action Space

To avoid bias, the system does not generate raw pixel images directly. Instead, it predicts low-dimensional facial parameters (expression coefficients and head pose) for a 3D Morphable Model (FLAME). By decoupling the action (expression dynamics) from the identity (static mesh), human feedback can focus purely on the social and emotional appropriateness of the reaction.

B. Stage 1: Supervised Fine-Tuning (SFT)

Architecture: A Vision-Language-Action (VLA) model based on a 7B-parameter LLaMA 2 backbone.
Input Processing:
- Visual: Uses a dual-stream encoder (DINO for pose/micro-details and SigLIP for global affective semantics) to process speaker video frames.
- Text: Tokenizes the speaker's language content.
- Output: The model predicts discrete action tokens representing facial parameters.
Training: The model is trained via SFT to imitate ground-truth listener actions from the dataset, establishing a baseline capability for coherent facial synthesis.

C. Stage 2: Human-Feedback Reinforcement Learning (HFRL)

This stage aligns the model with human preferences using a closed-loop strategy:

Candidate Generation: The SFT-trained policy samples multiple candidate listener action sequences ( $N=4$ ) for a given speaker input.
Human Annotation: These candidates are rendered into videos. Human annotators rate them based on four criteria: Empathy, Appropriateness, Engagement, and Naturalness.
Preference Pair Construction: The highest-rated response is designated as the Preferred (Pre) sample, and the lowest-rated as the Dispreferred (Dispre) sample.
Optimization: The model is refined using Direct Preference Optimization (DPO). The loss function encourages the policy to increase the probability of preferred actions while decreasing the probability of dispreferred actions relative to a reference model (the SFT model).

3. Key Contributions

First Closed-Loop Human-Feedback Alignment: The paper is the first to explicitly use human feedback in a closed-loop manner to align facial expression generation with human preference for dyadic interactions, ensuring responses are contextually and emotionally appropriate.
Identity-Independent Action Learning: By framing expression generation as an action learning process in a disentangled 3D parameter space, the method eliminates visual/identity bias, allowing for reliable assessment of expression quality.
Hybrid Training Strategy: The integration of SFT (for geometric fidelity) and Human-Feedback RL (for social alignment) creates a system that learns not just to mimic, but to optimize for social norms.
Comprehensive Evaluation: The method is validated on two major benchmarks (L2L-trevor and RealTalk) using both quantitative metrics and human user studies.

4. Experimental Results

The method was evaluated on the L2L-trevor and RealTalk datasets against state-of-the-art baselines (Random, Nearest Neighbor, LM-listener, MMLHG).

Quantitative Performance:
- Emotional Alignment: The proposed method (SFT+RL) achieved the lowest L2 Affect scores (4.3531 on RealTalk), significantly outperforming MMLHG (6.0427) and the SFT-only baseline. This indicates superior synchronization with the speaker's emotional state.
- Motion Quality: While the RL stage slightly increased geometric reconstruction error (FD) compared to SFT-only, it significantly improved P-FD (Paired Fréchet Distance) and P-ID (Perceptual Identity), demonstrating that the model prioritizes social appropriateness over rigid geometric copying.
Qualitative & User Study:
- Visual Examples: In cases where a speaker expresses disgust, baseline models (like LM-listener) often generated neutral or happy faces. The proposed method correctly generated disgust. Conversely, when speakers discussed serious topics, the method avoided "hallucinated" positive smiles seen in other models.
- User Ratings: In a study with 25 participants, the SFT+RL model achieved the highest Mean Opinion Scores (MOS) across all categories:
  - Appropriateness: 4.5 (vs. 3.0 for MMLHG)
  - Empathy: 4.1 (vs. 3.3 for MMLHG)
  - Naturalness: 4.5 (vs. 3.1 for MMLHG)

5. Significance

This work represents a significant shift in human-computer interaction (HCI) and dyadic modeling. By moving beyond simple imitation learning to preference-driven optimization, the method enables AI agents to engage in conversations that feel genuinely empathetic and socially aware.

Social Intelligence: It bridges the gap between technical realism (good looking faces) and social realism (appropriate reactions), which is crucial for applications in virtual companions, telepresence, and therapeutic robots.
Generalizability: The framework of using identity-independent action spaces for human feedback could be extended to other generative tasks where social norms and subjective quality are paramount.
Future Direction: It establishes a new paradigm for training generative models where "human preference" is not just a post-processing filter but a core component of the learning objective.