Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Imagine you have a single, static photo of a friend. Now, imagine you want to make that photo come alive, making your friend blink, smile, or talk, but you want to use the facial expressions of a different person (like a movie actor) as the guide.

This is the challenge the paper Export3D tackles. It's like trying to put a new soul into an old body without accidentally swapping the body's features (like giving your friend the actor's nose or eye shape).

Here is a simple breakdown of how they did it, using some everyday analogies:

The Problem: The "Bad Copy-Paste"

Previous methods for animating faces were like using a "Copy and Paste" tool in a photo editor. They tried to stretch and warp the source photo to match the movement of the driver.

The Issue: When you stretch a photo too much, it gets blurry or weird. Worse, when you try to copy an expression from a different person, the software often accidentally copies their appearance too. It's like trying to put a smile on your face, but suddenly your face turns into your friend's face because the software got confused between "smiling" and "who you are."

The Solution: The "3D Blueprint" (Tri-Plane)

Instead of just warping a flat 2D photo, Export3D builds a 3D mental blueprint of the face.

The Analogy: Think of a flat photo as a piece of paper. Export3D turns that paper into a 3D clay sculpture.
How it works: They use a "Tri-plane," which is like having three transparent sheets of glass stacked together (front, side, and top views). These sheets hold the 3D information of the face. Because it's a 3D model, the computer knows exactly how the face should look from any angle, not just the front.

The Secret Sauce: Cleaning the "Expression Signal"

The biggest hurdle is that the data used to describe facial expressions (called 3DMM parameters) is messy. It's like a radio signal that has both the music (the expression) and the DJ's voice (the person's identity) mixed together. If you just play the signal, you get the wrong DJ's voice.

1. The "Noise-Canceling" Headphones (Contrastive Pre-training)
The authors created a special training method called CLeBS.

The Analogy: Imagine you have a playlist of a singer singing different songs. You want to teach a robot to recognize just the melody (the expression) without caring about the singer's voice (the identity).
How it works: They showed the AI many videos of the same person making different faces. The AI learned to ignore the person's unique features and focus only on the movement. It's like training a chef to recognize the taste of "spicy" regardless of whether the food is in a red bowl or a blue bowl. This creates a "pure" expression signal that is free of identity contamination.

2. The "Smart Remote Control" (EAdaLN)
Once they have the "pure" expression signal, they need to apply it to the source photo.

The Analogy: Instead of trying to rebuild the whole clay sculpture from scratch, they use a "remote control" that tweaks the existing clay.
How it works: They use a special layer called EAdaLN. Think of this as a dial on a mixing board. You take the "pure" expression signal and turn the dials to adjust the source face. It tells the 3D blueprint: "Okay, keep the nose and eyes exactly as they are, but change the mouth shape to match this smile."

The Result: A Seamless Animation

Finally, the computer takes this modified 3D blueprint and "renders" it back into a 2D video.

The Magic: Because the system understands 3D space, you can rotate the camera, and the face will look natural from the side, top, or bottom.
No Identity Swap: Because they cleaned the expression signal first, your friend stays looking like your friend, even if they are making the actor's face.

Summary

Export3D is like a high-tech puppet master.

It builds a 3D clay model of your photo.
It uses noise-canceling headphones to isolate the "pure" movement from the "person" in the driving video.
It uses a smart remote to apply that movement to your clay model without changing the model's features.
The result is a video where your photo comes alive with a new expression, looking natural from any angle, without turning into someone else.

Here is a detailed technical summary of the paper "Export3D: 3D-aware Expression Controllable Portrait Animation".

1. Problem Statement

The paper addresses the challenge of one-shot 3D-aware portrait animation, specifically the ability to animate a source portrait image with the facial expressions and camera views of a driving video while preserving the source identity.

Key challenges identified in existing methods include:

Appearance-Expression Entanglement: In cross-identity transfer (animating Person A with Person B's expression), existing methods often inadvertently swap the source's appearance (e.g., changing eye shape or face contour) along with the expression. This is because facial expressions and identity features are highly entangled in image space and standard motion representations.
Limitations of 2D Warping: Most 2D-based methods rely on image warping or motion fields, which struggle to disentangle global head motion from local facial expressions, leading to temporal inconsistencies or artifacts.
Limitations of 3D Deformation: Existing 3D methods (e.g., using NeRFs or deformation fields) often suffer from video-level artifacts like flickering or fail to faithfully reconstruct the source identity when using latent codes.
3DMM Limitations: While 3D Morphable Models (3DMM) provide explicit expression parameters, these raw parameters are inherently entangled with identity/appearance information, causing "appearance swap" when transferred directly to a different person.

2. Methodology: Export3D

The authors propose Export3D, a framework that generates a 3D-aware tri-plane representation conditioned on the source image and driving expression parameters. The pipeline consists of three main components:

A. Contrastive Learned Basis Scaling (CLeBS)

To solve the appearance-expression entanglement, the authors introduce a pre-training framework to extract appearance-free expression representations.

Contrastive Learning: They train an expression encoder ( $f_e$ ) on video datasets. Positive pairs are aligned image-expression pairs from the same video (same identity, different expressions), while negative pairs are misaligned. This forces the encoder to learn representations that are invariant to identity (appearance) but sensitive to expression.
Orthogonal Structure (LeBS): The authors observe that 3DMM parameters can be mapped to an orthogonal basis. They introduce a Learned Basis Scaling (LeBS) module that projects high-dimensional 3DMM parameters ( $\beta \in \mathbb{R}^{64}$ $β \in R^{64}$ ) into a low-dimensional space ( $\lambda \in \mathbb{R}^n$ $λ \in R^{n}$ ) and scales a learned orthonormal basis ( $V$ $V$ ).
- Formula: $\beta' = \sum \lambda_i \mathbf{v}_i$ , where $\mathbf{v}_i$ are orthogonal basis vectors.
- This ensures that different expression directions are mathematically orthogonal, preventing the mixing of identity features.

B. Hybrid Tri-plane Generator

Instead of predicting deformation fields (which cause flickering) or warping 2D images, Export3D directly generates a Tri-plane ( $T$ ) representing the 3D prior of the source identity with the driving expression.

Architecture: The generator combines a Vision Transformer (ViT) and Convolutional layers.
Expression Adaptive Layer Normalization (EAdaLN): This is the core mechanism for injecting expression. Instead of using cross-attention, the refined expression vector $\beta'$ $β^{'}$ (from CLeBS) modulates the visual tokens via adaptive layer normalization right before the Multi-Head Self-Attention (MSA) and Mix-FFN blocks.
- Formula: $\text{EAdaLN}(x, \beta') = \sigma(\beta') \cdot \text{LN}(x) + \mu(\beta')$ .
Process: The source image is converted to visual tokens, processed through EAdaLN-ViT blocks conditioned on $\beta'$ , and upsampled to generate the tri-plane $T_{\beta_D}(S)$ .

C. Volume Rendering and Super-Resolution

Rendering: The generated tri-plane is rendered into a 2D RGB image using differentiable volume rendering (NeRF-style) based on the driving camera parameters ( $p_D$ ). This ensures 3D consistency across different views.
Super-Resolution: To handle computational costs, the model first renders a low-resolution image ( $H/4 \times W/4$ ) and then applies a super-resolution module (using convolutional blocks rather than style modulation) to reach the target resolution.

3. Key Contributions

Export3D Framework: A one-shot 3D-aware portrait animation method that explicitly controls facial expression and camera view without requiring paired training data or complex optimization.
CLeBS (Contrastive Learned Basis Scaling): A novel pre-training framework that distills appearance-free expression representations from 3DMM parameters using contrastive learning and an orthogonal basis structure. This effectively eliminates the "appearance swap" problem in cross-identity transfer.
EAdaLN Mechanism: An expression conditioning method using adaptive layer normalization within a ViT-based generator, which outperforms standard cross-attention in preserving identity while transferring expressions.
Direct Tri-plane Generation: Unlike methods that deform existing 3D representations, this method directly generates the 3D prior (tri-plane) conditioned on expression, avoiding deformation-induced artifacts like flickering.

4. Experimental Results

The model was evaluated on the VFHQ and TalkingHead-1KH datasets, comparing against state-of-the-art 2D (DPE, StyleHEAT) and 3D (HiDe-NeRF, OTAvatar, ROME) methods.

Quantitative Performance:
- Cross-Identity Transfer: Export3D achieved the best CSIM (Identity Preservation) and AED (Expression Distance) scores among competitors. For example, on VFHQ cross-identity, it achieved a CSIM of 0.694 and AED of 0.208, outperforming HiDe-NeRF (CSIM 0.707 but higher AED 0.255) and DPE (CSIM 0.586).
- Same-Identity Transfer: It achieved the highest PSNR (23.555) and lowest AKD (3.453), indicating superior structural fidelity.
Qualitative Analysis:
- Appearance Swap: Visual results show that while competitors like DPE and HiDe-NeRF often change the source's eye shape or face contour when transferring expressions, Export3D preserves the source identity perfectly.
- Artifacts: The method produces videos without the flickering or temporal inconsistencies seen in deformation-field-based methods.
- Ablation Studies: Removing CLeBS (using raw 3DMM parameters) resulted in significant appearance changes. Replacing EAdaLN with Cross-Attention degraded expression control accuracy.

5. Significance and Impact

Solving the Entanglement Problem: The paper makes a significant theoretical and practical contribution by demonstrating that 3DMM expression parameters can be disentangled from identity using contrastive learning and orthogonal basis scaling. This is a crucial step for reliable cross-identity animation.
Robust 3D Animation: By moving away from 2D warping and point-wise deformation fields toward direct tri-plane generation, the method achieves higher temporal stability and 3D consistency.
Applications: The technology is highly relevant for virtual avatars, video conferencing, and film dubbing, where maintaining the user's identity while adopting natural expressions from a driver is critical.
Limitations: The authors acknowledge that the method currently cannot separate foreground/background (rendering them as a whole) and cannot control non-facial body parts or eye gaze, as these are outside the scope of standard 3DMM parameters.

In summary, Export3D represents a state-of-the-art approach to portrait animation by effectively decoupling expression from identity through contrastive pre-training and leveraging a hybrid ViT-Conv generator for high-fidelity, 3D-consistent synthesis.