Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation

Imagine you have a photo of a friend, and you want to make them talk, smile, or turn their head in a video, but you don't want to lose their unique look. This is the magic of Face Animation.

However, existing magic tricks have a flaw: when you try to make the friend turn their head, their face often stretches weirdly, or when they smile, their whole head size changes. It's like trying to put a new driver in a car, but the car's engine (the identity) gets mixed up with the steering wheel (the motion).

This paper introduces a new method called MMFA (Motion Manipulation via unsupervised keypoint positioning in Face Animation). Think of it as a "Smart Puppeteer" that can control a face without breaking the character.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Tangled Yarn"

Imagine a ball of yarn where the color of the thread represents the person's face (their identity), and the direction the thread is pulled represents how they move (smiling, turning, looking up).

Old methods tried to pull the thread to make the face move, but they accidentally pulled the color thread too. So, when the person turned their head, their face shape changed, or they started looking like a different person.
The Goal: We need to untangle the yarn so we can pull the "motion" thread without messing up the "identity" thread.

2. The Solution: MMFA's Three Magic Tools

Tool A: The "Universal Skeleton" (Keypoint Decomposition)

Instead of guessing where the face is, MMFA builds a Universal Skeleton (called "canonical keypoints").

Analogy: Imagine every human face is built on the same invisible mannequin.
How it works: The system first finds this invisible mannequin for the source photo. Then, it calculates exactly how much to rotate, move, and scale (zoom in/out) that mannequin to match the driving video.
The Trick: It adds a special "zoom factor" to handle the fact that faces look bigger when they are close to the camera and smaller when far away. This ensures that when the person smiles, the system knows it's just a smile, not the face getting bigger.

Tool B: The "Identity Guardian" (Self-Supervised Learning)

This is the system's way of saying, "I promise to keep the face looking like the original person."

Analogy: Imagine a strict bouncer at a club. No matter how much the person dances (moves their head) or changes their outfit (expression), the bouncer checks their ID card constantly to make sure it's still the same person.
How it works: The AI is trained to look at a face, twist it, zoom it, and then check: "Is this still the same person?" If the answer is "No," it learns to fix it. This separates the expression (the smile) from the pose (the head turn) so you can control them independently.

Tool C: The "Smooth Slider" (The VAE)

This is the coolest part. It allows you to create new expressions that don't exist in the original video.

Analogy: Imagine a music equalizer with sliders for "Happy," "Sad," and "Surprised." Old methods could only play the songs they already had recorded. MMFA builds a Smooth Slider (a continuous space) where you can slide the dial from "Neutral" to "Big Grin" and get every tiny step in between.
How it works: The system uses a special math trick (a Variational Autoencoder) to turn facial expressions into a smooth, continuous map. You can pick any point on this map to generate a perfect, natural-looking smile, even if the original video never showed that exact smile.

3. Why is this better than what we have now?

No "Melting" Faces: Because it separates the motion from the identity, the face doesn't stretch or warp weirdly when the person turns their head.
Total Control: You can make a person look left, right, up, or down, and smile or frown, all independently.
Better Quality: In tests, MMFA created faces that looked more real and kept the person's identity better than previous "state-of-the-art" methods. It's like the difference between a blurry, distorted photocopy and a high-definition 4K photo.

Summary

MMFA is like giving a puppeteer a set of independent strings.

One string controls the Head Turn.
One string controls the Zoom.
One string controls the Smile.

Before, pulling one string would accidentally tug on the others. Now, you can pull the "Smile" string as hard as you want, and the "Head Turn" string stays perfectly still, keeping the person's face looking exactly like themselves. This makes for incredibly realistic and controllable digital avatars for video calls, movies, and virtual reality.

1. Problem Statement

Face animation aims to generate photo-realistic, continuous facial motion videos from static images driven by motion sequences. While existing methods have made progress, they face significant challenges in controllability and decoupling:

Coupling of Attributes: Existing unsupervised keypoint-based methods (e.g., FOMM, Face-vid2vid) struggle to fully decouple identity semantics from motion information (rotation, translation, expression, and scale).
Expression Limitations: In methods like Face-vid2vid, expression deformations are often entangled with facial scaling and perspective effects, making accurate, independent expression manipulation difficult.
Lack of Continuous Control: Most methods lack a mechanism to interpolate facial expressions continuously in an unsupervised framework, limiting the ability to generate arbitrary expressions without a driving source.
Identity Leakage: Cross-identity reenactment often results in the loss of the source identity or significant deformation when facial shapes differ greatly.

2. Methodology: MMFA

The authors propose MMFA, a novel framework that combines unsupervised keypoint positioning with self-supervised representation learning and a Variational Autoencoder (VAE). The pipeline consists of three main components:

A. Revised Keypoint Decomposition Pipeline

The authors introduce a new decomposition framework based on scaled orthographic projection to handle perspective effects and scale variations:

Assumptions: The face is treated as a rigid object where the centroid is at the origin, and the mapping to the image plane follows orthographic projection.
Decomposition: Facial keypoints are decomposed into:
1. Canonical Keypoints ( $p_C$ ): 3D spatial anchors representing the identity.
2. Pose Parameters: Rotation ( $R$ ) and Translation ( $t$ ).
3. Scale Factor ( $f$ ): To handle distance variations between the face and camera.
4. Expression Deformations ( $\delta$ ): Non-rigid deformations specific to facial expressions.
Transformation: The final keypoints for source ( $S$ ) and driving ( $D$ ) frames are computed as:
$p_S = R_S f_S (p_C + \delta_S) + t_S$
$p_D = R_D f_D (p_C + \delta_D) + t_D$
This explicit separation allows for independent manipulation of pose, scale, and expression.

B. Self-Supervised Representation Learning

To decouple expression features from other attributes (pose, scale), the authors employ an Expression Encoder-Decoder:

Encoder ( $E_e$ ): Extracts expression features ( $f_\delta$ ) from the input image.
Decoder ( $D_e$ ): Reconstructs the expression deformation ( $\delta$ ) using the feature $f_\delta$ and the canonical keypoints $p_C$ .
Self-Supervised Loss ( $L_{Exp}$ ): The model is trained to ensure that the expression feature remains invariant under data augmentation (rotation, scaling, translation). By maximizing the cosine similarity between features of the original image and augmented versions, the model learns to isolate expression semantics from geometric transformations.
Additional Losses:
- Identity Latent Consistency ( $L_C$ ): Ensures canonical keypoints remain consistent regardless of pose changes.
- 2D Landmark Loss ( $L_M$ ): Enforces consistency in 2D facial landmarks (face, mouth, pupils) between generated and driving frames.

C. Variational Autoencoder (VAE) for Expression Interpolation

To enable continuous control and interpolation of expressions:

The expression feature $f_\delta$ is mapped to a continuous latent space using a VAE.
The VAE maps $f_\delta$ to a Gaussian distribution $\mathcal{N}(\mu, \sigma)$ , allowing the sampling of latent codes ( $z$ ) to generate new expressions.
Adversarial Loss: An adversarial loss is introduced to prevent the "posterior collapse" problem common in VAEs, ensuring the latent space captures diverse and distinguishable expression features rather than converging to a mean average.

D. Multi-Scale Generator

The generator reconstructs the target face at multiple resolutions (64x64, 128x128, 256x256) using a multi-scale architecture. This is optimized with a Multi-Scale Perceptual Loss to ensure high-fidelity detail transfer and texture consistency.

3. Key Contributions

Decoupled Motion Control: MMFA successfully disentangles identity, pose, scale, and expression using a scaled orthographic projection pipeline, enabling precise attribute manipulation.
Unsupervised Expression Interpolation: By integrating a VAE with adversarial training, the method allows for the first time in an unsupervised framework to interpolate facial expressions continuously, enabling the generation of arbitrary expressions without a driving source.
Robust Cross-Identity Reenactment: The method maintains high identity consistency even when the driving and source faces have significant shape differences, outperforming prior 2D and 3D keypoint-based methods.
State-of-the-Art Performance: Extensive experiments demonstrate superior performance in both generation quality (FID) and motion transfer accuracy.

4. Experimental Results

The method was evaluated on the VoxCeleb dataset for training and CelebA/FFHQ for cross-identity testing.

Quantitative Metrics:
- FID (Fréchet Inception Distance): MMFA achieved the lowest FID (13.265 for same-identity, 77.445 for cross-identity), indicating the highest visual realism and distribution similarity to real images compared to baselines like FOMM, Face-vid2vid, DaGAN, and DPE.
- Identity Preservation: High CSIM (Cosine Similarity) and low AED (Average Euclidean Distance) scores confirm that MMFA retains the source identity better than competitors, especially in cross-identity scenarios.
- Motion Transfer: Low APD (Average Pose Distance) and AKD (Average Keypoint Distance) indicate accurate transfer of pose and motion.
Qualitative Analysis:
- Visual Quality: MMFA produces sharper images with better detail preservation (e.g., teeth, eyes) compared to the blur often seen in DPE or the identity leakage in FOMM.
- Attribute Editing: The model allows independent editing of rotation, translation, scale, and expression. Unlike DPE, MMFA does not distort background elements (e.g., ties, collars) when manipulating facial attributes.
- Interpolation: The VAE latent space allows for smooth transitions between expressions, creating natural intermediate frames.

5. Significance and Impact

Technical Advancement: MMFA bridges the gap between unsupervised keypoint methods (which lack explicit control) and 3DMM-based methods (which rely on heavy priors). It achieves explicit control with minimal priors.
Application Potential: The ability to generate realistic, controllable, and continuous facial animations has broad applications in:
- Virtual Reality & Gaming: Creating realistic avatars.
- Video Conferencing & Telepresence: Enhancing user experience with high-fidelity virtual portraits.
- Digital Content Creation: Enabling efficient editing of facial attributes in video portraits.
Future Directions: The authors acknowledge current limitations in computational cost due to 3D operations and plan to explore hybrid 2D/3D architectures for faster inference.

In conclusion, MMFA represents a significant step forward in face animation by providing a robust, unsupervised framework that offers unprecedented control over facial motion and expression while maintaining high visual fidelity and identity consistency.