WildActor: Unconstrained Identity-Preserving Video Generation

This paper introduces WildActor, a framework for unconstrained identity-preserving human video generation that leverages the large-scale Actor-18M dataset and novel attention mechanisms to overcome existing limitations in maintaining consistent full-body identities across dynamic shots, viewpoints, and motions.

Qin Guo, Tianyu Yang, Xuanhua He, Fei Shen, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Dan Xu

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a movie director. You have a star actor, let's call him "Alex." You want to film a scene where Alex walks through a forest, climbs a mountain, and then turns around to face the camera.

In the real world, Alex is the same person throughout. His face, his clothes, and his body shape stay consistent, no matter how the camera moves.

But in the world of AI video generation, this is a nightmare. Current AI models often suffer from two main problems:

  1. The "Floating Head" Effect: The AI gets the face right, but the body looks like a hallucination or changes clothes randomly.
  2. The "Mannequin" Effect: The AI locks the character into a stiff pose. If you ask the character to turn around, the AI just pastes the same image over and over, or the character freezes because it's afraid to move.

Enter "WILDACTOR." Think of this as a revolutionary new director and a massive library of reference photos that solves these problems.

Here is a simple breakdown of how it works, using everyday analogies:

1. The Massive Photo Library (Actor-18M)

To teach an AI to be a good actor, you need to show it thousands of examples of that actor from every possible angle.

  • The Problem: Most existing AI datasets are like a photo album where everyone is only taking selfies. The AI only knows what a person looks like from the front. When asked to show the back of the person, it guesses and gets it wrong.
  • The Solution: The creators built Actor-18M, a massive library containing 1.6 million videos and 18 million photos.
  • The Analogy: Imagine you are trying to teach a robot to draw your friend. Instead of just showing it one selfie, you give it a 360-degree photo booth session, plus photos of your friend in the rain, in the sun, wearing a hat, and climbing a ladder. This dataset includes "canonical" views (front, side, back) and "wild" views (any crazy angle). This teaches the AI that "Alex" is the same person whether he is facing left, right, or upside down.

2. The "One-Way Street" Attention (Asymmetric Identity-Preserving Attention)

The AI model needs to mix two things: The Story (what the character is doing) and The Identity (who the character is).

  • The Problem: Usually, AI models treat these two things equally. It's like trying to have a conversation where both people shout at the same time. The "Identity" (the static photo) gets so loud that it drowns out the "Story" (the movement), causing the "Mannequin Effect" where the character can't move.
  • The Solution: WILDACTOR uses a special mechanism called Asymmetric Attention.
  • The Analogy: Imagine a director (the video generation) and a costume designer (the identity reference).
    • In old models, the costume designer would scream, "WEAR THIS HAT!" and the director would freeze, unable to move the actor.
    • In WILDACTOR, the relationship is a one-way street. The costume designer whispers the details of the face and clothes into the director's ear. The director listens carefully to keep the look consistent, but the director is free to move the actor around the set. The "static" details don't block the "dynamic" action.

3. The "Smart Camera Crew" (Viewpoint-Adaptive Sampling)

When training the AI, you have to pick which photos to show it.

  • The Problem: If you randomly pick photos, you might accidentally pick 10 photos of the person's face from the front and zero photos from the back. The AI learns that "front" is the only way a person exists.
  • The Solution: They use a Viewpoint-Adaptive Monte Carlo Sampling strategy.
  • The Analogy: Imagine a camera crew filming a training session. If the camera keeps filming the actor from the front, the actor gets bored and the crew gets lazy.
    • WILDACTOR's "Smart Camera Crew" has a rule: "If we just filmed the actor from the front, we must move to the side or back for the next shot."
    • It actively avoids taking too many similar photos. It forces the AI to learn how the character looks from every angle, ensuring the character doesn't look weird when they turn around.

4. The "Name Tags" (I-RoPE)

Inside the AI's brain, everything is broken down into tiny pieces called "tokens."

  • The Problem: The AI gets confused between "moving video parts" and "static reference parts." It's like a librarian who puts all the books on the same shelf without labels, so they can't tell which book is a movie script and which is a photo.
  • The Solution: They use I-RoPE (Identity-Aware 3D Rotary Positional Encoding).
  • The Analogy: This is like giving the "Identity" photos a special Name Tag or a different colored shelf. The AI now knows: "Oh, this token is the static face (don't move it), and this token is the walking leg (move it!)." This prevents the AI from mixing up the character's face with their motion.

The Result

When you put all these pieces together, WILDACTOR can generate videos where:

  • A character can walk, run, and turn around.
  • The camera can zoom in, fly around, or switch angles.
  • The character's face, clothes, and body shape remain perfectly consistent throughout the entire video.

It's like having a digital actor who never forgets their lines, never loses their costume, and can perform any stunt you ask, no matter how crazy the camera moves. This is a huge step toward making AI-generated movies that feel truly real and professional.