WildActor: Unconstrained Identity-Preserving Video Generation

Imagine you are a movie director. You have a star actor, let's call him "Alex." You want to film a scene where Alex walks through a forest, climbs a mountain, and then turns around to face the camera.

In the real world, Alex is the same person throughout. His face, his clothes, and his body shape stay consistent, no matter how the camera moves.

But in the world of AI video generation, this is a nightmare. Current AI models often suffer from two main problems:

The "Floating Head" Effect: The AI gets the face right, but the body looks like a hallucination or changes clothes randomly.
The "Mannequin" Effect: The AI locks the character into a stiff pose. If you ask the character to turn around, the AI just pastes the same image over and over, or the character freezes because it's afraid to move.

Enter "WILDACTOR." Think of this as a revolutionary new director and a massive library of reference photos that solves these problems.

Here is a simple breakdown of how it works, using everyday analogies:

1. The Massive Photo Library (Actor-18M)

To teach an AI to be a good actor, you need to show it thousands of examples of that actor from every possible angle.

The Problem: Most existing AI datasets are like a photo album where everyone is only taking selfies. The AI only knows what a person looks like from the front. When asked to show the back of the person, it guesses and gets it wrong.
The Solution: The creators built Actor-18M, a massive library containing 1.6 million videos and 18 million photos.
The Analogy: Imagine you are trying to teach a robot to draw your friend. Instead of just showing it one selfie, you give it a 360-degree photo booth session, plus photos of your friend in the rain, in the sun, wearing a hat, and climbing a ladder. This dataset includes "canonical" views (front, side, back) and "wild" views (any crazy angle). This teaches the AI that "Alex" is the same person whether he is facing left, right, or upside down.

2. The "One-Way Street" Attention (Asymmetric Identity-Preserving Attention)

The AI model needs to mix two things: The Story (what the character is doing) and The Identity (who the character is).

The Problem: Usually, AI models treat these two things equally. It's like trying to have a conversation where both people shout at the same time. The "Identity" (the static photo) gets so loud that it drowns out the "Story" (the movement), causing the "Mannequin Effect" where the character can't move.
The Solution: WILDACTOR uses a special mechanism called Asymmetric Attention.
The Analogy: Imagine a director (the video generation) and a costume designer (the identity reference).
- In old models, the costume designer would scream, "WEAR THIS HAT!" and the director would freeze, unable to move the actor.
- In WILDACTOR, the relationship is a one-way street. The costume designer whispers the details of the face and clothes into the director's ear. The director listens carefully to keep the look consistent, but the director is free to move the actor around the set. The "static" details don't block the "dynamic" action.

3. The "Smart Camera Crew" (Viewpoint-Adaptive Sampling)

When training the AI, you have to pick which photos to show it.

The Problem: If you randomly pick photos, you might accidentally pick 10 photos of the person's face from the front and zero photos from the back. The AI learns that "front" is the only way a person exists.
The Solution: They use a Viewpoint-Adaptive Monte Carlo Sampling strategy.
The Analogy: Imagine a camera crew filming a training session. If the camera keeps filming the actor from the front, the actor gets bored and the crew gets lazy.
- WILDACTOR's "Smart Camera Crew" has a rule: "If we just filmed the actor from the front, we must move to the side or back for the next shot."
- It actively avoids taking too many similar photos. It forces the AI to learn how the character looks from every angle, ensuring the character doesn't look weird when they turn around.

4. The "Name Tags" (I-RoPE)

Inside the AI's brain, everything is broken down into tiny pieces called "tokens."

The Problem: The AI gets confused between "moving video parts" and "static reference parts." It's like a librarian who puts all the books on the same shelf without labels, so they can't tell which book is a movie script and which is a photo.
The Solution: They use I-RoPE (Identity-Aware 3D Rotary Positional Encoding).
The Analogy: This is like giving the "Identity" photos a special Name Tag or a different colored shelf. The AI now knows: "Oh, this token is the static face (don't move it), and this token is the walking leg (move it!)." This prevents the AI from mixing up the character's face with their motion.

The Result

When you put all these pieces together, WILDACTOR can generate videos where:

A character can walk, run, and turn around.
The camera can zoom in, fly around, or switch angles.
The character's face, clothes, and body shape remain perfectly consistent throughout the entire video.

It's like having a digital actor who never forgets their lines, never loses their costume, and can perform any stunt you ask, no matter how crazy the camera moves. This is a huge step toward making AI-generated movies that feel truly real and professional.

1. Problem Statement

The paper addresses a critical bottleneck in professional-grade video generation: maintaining strict full-body identity consistency across dynamic shots, changing viewpoints, and complex motions.

Current state-of-the-art Diffusion Transformer (DiT) models struggle with two primary failure modes:

Face-Centric Bias: Methods relying on face recognition encoders often preserve facial features but fail to maintain body consistency, leading to "floating head" effects where the body hallucinates or changes texture.
Pose Locking & Copy-Paste Artifacts: Methods that naively concatenate reference images often treat the reference pose as a rigid constraint. This prevents the subject from moving naturally, resulting in static, "copy-pasted" characters that cannot follow motion prompts or viewpoint changes.
Data Scarcity: Existing datasets lack large-scale, unconstrained human video data with consistent multi-view annotations, forcing models to rely on expensive studio captures or face-only data.

2. Methodology

The authors propose a holistic solution comprising a new large-scale dataset and a novel generation framework.

A. Actor-18M: A Large-Scale Human Video Dataset

To overcome data limitations, the authors curated Actor-18M, a dataset containing 1.6 million high-quality videos and 18 million corresponding human images.

Construction: The dataset is built from internal high-quality sources and OpenS2V, filtered via a two-stage pipeline (coarse facial similarity and fine-grained point tracking/clip verification).
Three Subsets:
- Subset A (View-Aug): Uses multi-angle image editing to synthesize face and body images from six diverse viewing angles, explicitly mitigating the "frontal bias" common in raw data.
- Subset B (Attr-Aug): Introduces attribute diversification (200 environments, 8 expressions, 10 lighting conditions) to prevent overfitting to specific backgrounds or styles.
- Subset C (Canonical 3-View): Provides canonical front, side, and back views as complete identity anchors.
Impact: This dataset provides dense supervision for learning view-invariant human representations under unconstrained conditions.

B. WILDACTOR Framework

WILDACTOR is a unified framework for any-view conditioned human video generation built upon a latent video DiT. It introduces three key technical innovations:

Asymmetric Identity-Preserving Attention (AIPA):
- Problem: Naive full attention causes "identity leakage," where static reference features dominate motion generation, causing pose locking.
- Solution: AIPA enforces a unidirectional information flow. Reference tokens (face/body) provide identity cues to video tokens but remain isolated from the noisy backbone features.
- Implementation: It uses Reference-only LoRA (applying learnable LoRA parameters only to reference tokens) and an asymmetric attention flow where video tokens query reference tokens, but reference tokens do not attend back to video tokens.
Identity-Aware 3D RoPE (I-RoPE):
- Problem: Video tokens (dynamic) and reference tokens (static) sharing the same positional encoding causes ambiguity between temporal motion and static appearance.
- Solution: I-RoPE assigns distinct spatio-temporal coordinates:
  - Temporal Separation: Reference tokens are assigned fixed temporal offsets ( $T + \Delta$ ) distinct from video tokens ( $t$ ).
  - Spatial Separation: Reference tokens are shifted to start from maximum spatial dimensions ( $H_{max}, W_{max}$ ), ensuring they occupy distinct positions in the joint embedding space.
Viewpoint-Adaptive Monte Carlo Sampling:
- Problem: Random sampling often yields redundant views (e.g., multiple front views), hindering multi-view learning.
- Solution: A dynamic re-weighting strategy that suppresses the probability of sampling images within the angular neighborhood of an already selected reference. This encourages the model to observe complementary viewpoints (e.g., side/back) during training, ensuring balanced manifold coverage.

3. Key Contributions

Actor-18M Dataset: The first large-scale human video dataset (1.6M videos, 18M images) designed specifically for view-invariant identity learning, featuring diverse viewpoints, environments, and canonical 3-view anchors.
WILDACTOR Framework: A novel architecture featuring AIPA and I-RoPE that decouples identity injection from backbone representations, enabling robust generation without pose locking or identity drift.
Actor-Bench: A new evaluation benchmark consisting of 75 subjects with three conditioning settings (canonical 3-view, arbitrary viewpoint, in-the-wild) to rigorously test narrative coherence and contextual generalization.

4. Experimental Results

The model was evaluated on Actor-Bench against strong baselines, including open-source models (VACE, Stand-In) and commercial models (Vidu Q2, Kling 1.6).

Quantitative Performance:
- Body Consistency: WILDACTOR achieved 0.952, significantly outperforming the next best (Vidu Q2 at 0.905) and far surpassing baselines like Stand-In (0.416).
- Face Identity: Achieved 0.559, comparable to commercial models but with superior body consistency.
- Semantic Alignment: Achieved the highest VLM-level score (0.920), indicating strong adherence to complex prompts involving motion and viewpoint changes.
Qualitative Performance:
- Sequential Narrative: Successfully generated long-form videos with coherent storylines, maintaining identity across camera pans, zooms, and turns, whereas baselines suffered from identity drift or clip discontinuities.
- Viewpoint Transitions: Demonstrated the ability to generate subjects from side and back views even when only frontal references were provided, a capability where prior methods failed (often producing distorted bodies).
Ablation Studies:
- Removing AIPA caused a drop in semantic alignment (conflict between text and reference).
- Removing I-RoPE caused a sharp decline in body consistency due to feature confusion.
- Using Viewpoint-Adaptive Sampling improved back-view consistency from 0.680 (Raw-Crop) to 0.937.

5. Significance

WILDACTOR represents a significant leap forward in production-ready human video generation. By solving the "pose-locking" and "identity-drift" problems through architectural innovations (AIPA, I-RoPE) and a dedicated large-scale dataset (Actor-18M), it enables the creation of digital actors that can perform complex, long-form narratives with consistent full-body identities across unconstrained environments. This bridges the gap between current generative AI capabilities and the rigorous demands of professional cinematography and storytelling.

WildActor: Unconstrained Identity-Preserving Video Generation

1. The Massive Photo Library (Actor-18M)

2. The "One-Way Street" Attention (Asymmetric Identity-Preserving Attention)

3. The "Smart Camera Crew" (Viewpoint-Adaptive Sampling)

4. The "Name Tags" (I-RoPE)

The Result

1. Problem Statement

2. Methodology

A. Actor-18M: A Large-Scale Human Video Dataset

B. WILDACTOR Framework

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers