FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Imagine you have a home video of a friend talking to the camera. You love the video, but you wish you could move the camera around them—zoom in for a dramatic close-up, pan around to see their profile, or pull back to show their whole outfit—without actually having a second camera or a film crew.

That's exactly what FaceCam does. It's a new AI system that lets you take a single, ordinary video of a person and "re-film" it from any angle you want, as if you were holding a camera that can magically float around them.

Here is how it works, broken down with some simple analogies:

1. The Problem: The "Zoom Ambiguity" Trap

Most previous AI tools that try to move a camera get confused by a simple question: "Is the person getting bigger because the camera moved closer, or because they actually grew?"

In the real world, if you walk closer to someone, they look bigger. But if you just zoom your camera lens in, they also look bigger. To a computer, these two things look identical in a flat video. This is called scale ambiguity.

The Analogy: Imagine looking at a photo of a person through a window. If you tell the computer, "Move the camera 1 meter closer," the computer doesn't know if the person is actually 1 meter away or 100 meters away. It might accidentally stretch their face like a rubber band or shrink the background into a tiny dot. This causes the "geometric distortions" and weird artifacts mentioned in the paper.

2. The Solution: The "Face Map" Trick

The authors of FaceCam realized that instead of giving the AI complex math coordinates (like "move 3 meters left"), they should give it something the AI can actually see: Facial Landmarks.

The Analogy: Think of a puppeteer. Instead of telling the puppeteer, "Move your hand 5 inches to the right," you just show them a drawing of where the puppet's eyes and nose should be.
How FaceCam does it:
1. It looks at the first frame of your video and finds the "dots" on the face (eyes, nose, mouth corners).
2. It asks you: "Where do you want the camera to be?"
3. It draws a new set of dots on a blank screen to show what the face would look like from that new angle.
4. It tells the AI: "Make the video look like the face matches these new dots."

Because the dots are drawn on the screen (pixels), the AI doesn't have to guess about distance or size. It just knows, "Okay, the nose needs to be here." This solves the "scale ambiguity" problem perfectly.

3. The Training: Learning from a Studio and the Wild

To teach the AI how to do this, they needed a lot of practice data.

The Studio Data: They used a dataset called NeRSemble, where people were filmed by 16 cameras at once in a studio. This is like having a "perfect" training set where the AI can see the same person from every angle simultaneously.
The Problem: In the studio, the cameras never moved; they just sat there. But in real life, we want the camera to glide smoothly.
The Hack: The researchers created a clever training trick called "Multi-shot Stitching."
- The Analogy: Imagine you have 100 photos of a person taken from different angles. You cut them up and tape them together into a single video. To the AI, it looks like the camera is jumping around wildly.
- The Magic: Even though the AI was only trained on these "jumping" videos, when you ask it to make a smooth video later, it figures out how to connect the dots smoothly. It's like learning to drive by practicing on a bumpy dirt road; when you get on a smooth highway, you drive even better.

4. The Result: A Magic Director

When you use FaceCam, you get a video that:

Keeps the person real: Their face doesn't melt, their hair still flows naturally, and their identity stays the same.
Follows your instructions: If you say "Zoom in," it zooms in. If you say "Pan left," it pans left.
Handles the background: If you move the camera to a new angle, the AI invents a realistic background and clothing details that weren't visible in the original video (like seeing the back of a jacket).

In a Nutshell

FaceCam is like giving you a virtual film crew. You upload a selfie video, and the AI acts as a director who knows exactly how to move the camera around your subject without ever losing the subject's face or making them look like a distorted cartoon. It does this by ignoring complex 3D math and instead focusing on the simple, reliable "dots" of a human face.

1. Problem Statement

The paper addresses the challenge of controllable video generation for human portraits. Specifically, given a single monocular source video of a person, the goal is to generate a new video of the same scene from a user-specified, continuous camera trajectory.

Existing approaches face two critical limitations when applied to portrait videos:

Scale Ambiguity: Traditional camera control methods often rely on scene-agnostic extrinsic parameters (rotation and translation vectors). In monocular settings, absolute metric depth is unobservable. Consequently, the same camera parameter change can result in vastly different visual outcomes depending on the unknown scale of the subject, leading to drift, geometric distortions, and poor controllability.
3D Reconstruction Errors: Methods that infer camera motion from 3D reconstruction (e.g., depth estimation or point clouds) are prone to geometric errors. In portrait videos, even minor 3D errors amplify into severe perceptual artifacts, such as facial shape distortion, identity drift, or unnatural hair movement, due to human sensitivity to facial features.
Data Scarcity: Acquiring training data with paired videos (same dynamic scene, different camera trajectories) and ground-truth camera annotations is extremely difficult, especially for high-fidelity, dynamic human portraits.

2. Methodology

The authors propose FaceCam, a system that generates high-fidelity portrait videos with precise camera control by introducing a novel conditioning mechanism and a specialized data generation pipeline.

A. Scale-Aware Camera Conditioning

Instead of using standard extrinsic camera parameters ( $R, t$ ), FaceCam introduces a face-tailored, scale-aware representation based on image-space point correspondences.

Mechanism: The system detects facial landmarks on the source video (anchor frame) and renders them into a 2D "landmark map" based on the target camera trajectory.
Scale Invariance: By encoding the camera transformation as the relative 2D projection of 3D facial landmarks, the method inherently resolves monocular scale ambiguity. The 2D projections remain invariant to global scaling of the 3D scene and translation, providing a deterministic conditioning signal that does not rely on unknown metric depth.
Intuitive Control: Users can specify camera trajectories by simply inspecting the rendered facial shape in the landmark map, making the control process intuitive and precise.

B. Training Data Generation Pipeline

To overcome the lack of paired training data with continuous camera motion, the authors developed a pipeline that synthesizes training pairs from static multi-view data and unlabeled in-the-wild videos:

Base Dataset: Utilizes NeRSemble, a multi-view studio dataset with 425 subjects, providing high-quality dynamic facial expressions and head movements but only static cameras.
Synthetic Camera Motion: Simulates continuous Zoom and Pan motions on both studio and in-the-wild videos to teach the model smooth trajectory transitions.
Multi-shot Stitching: To simulate camera rotation (which synthetic motion alone cannot achieve), the system stitches together 1–4 clips captured from different viewpoints in the NeRSemble dataset. This creates discrete viewpoint changes during training. Remarkably, the model generalizes this discrete learning to infer smooth, continuous camera rotations during inference.
In-the-Wild Augmentation: To prevent overfitting to studio lighting, the model is further trained on ~800 unlabeled in-the-wild videos augmented with synthetic camera motion.

C. Architecture

Backbone: Built upon Wan, a large-scale open-source video foundation model (Diffusion Transformer).
Conditioning: The source video is encoded via a VAE. The camera condition (rendered landmark maps) is injected via channel conditioning, while the source video latent is injected via frame conditioning.
Inference: A generic 3D Gaussian head model is rendered along the target trajectory to generate the facial landmark sequence, which serves as the conditioning signal for the diffusion model.

3. Key Contributions

Scale-Aware Representation: A novel camera conditioning method using rasterized facial landmark maps that resolves monocular scale ambiguity without relying on 3D priors or explicit depth estimation.
Data Generation Strategy: A pipeline that enables training on continuous camera trajectories using only static multi-view captures and unlabeled in-the-wild videos, eliminating the need for expensive 4D synthetic data.
State-of-the-Art Performance: The system achieves superior performance in camera controllability, visual quality, and the preservation of subject identity, facial expressions, and fine-grained details (e.g., hair dynamics).

4. Experimental Results

The authors evaluated FaceCam on the Ava-256 dataset (studio captures) and diverse in-the-wild portrait videos, comparing against baselines like ReCamMaster and TrajectoryCrafter.

Quantitative Performance:
- Ava-256: FaceCam achieved significantly higher PSNR (15.85 vs. ~10.3), SSIM (0.72 vs. ~0.55), and ArcFace identity scores (0.857 vs. ~0.70) compared to baselines.
- In-the-Wild: FaceCam demonstrated 97% camera trajectory correctness and the highest ArcFace similarity (83.94), outperforming baselines in identity preservation and visual consistency.
Qualitative Results:
- FaceCam successfully handles large pose changes, zooms, and pans without the "hallucinated backgrounds" or "face flattening" seen in baselines.
- It preserves subtle details like hair motion, accessories (glasses, jewelry), and realistic facial expressions.
- The model generalizes to unseen domains, including cartoon characters, despite being trained primarily on real human data.

5. Significance

FaceCam represents a significant advancement in dynamic novel view synthesis for human portraits. By shifting the conditioning paradigm from geometric parameters to image-space correspondences, it solves the fundamental scale ambiguity problem inherent in monocular video generation. This approach allows for:

Precise Control: Users can author complex camera trajectories with predictable outcomes.
High Fidelity: Preservation of identity and subtle motion dynamics crucial for applications in social media, telepresence, and AR/VR.
Scalability: The data generation strategy proves that high-quality camera control can be learned without massive, expensive 4D synthetic datasets, making the technology more accessible for real-world deployment.

The work bridges the gap between theoretical 3D geometry and practical video generation, offering a robust solution for one of the most challenging control dimensions in generative AI.