MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Imagine you are trying to teach a robot to draw a video of a person running. If you just tell the robot, "Draw a person running," it might get the clothes and the face perfect, but the legs might twist backward, the arms might disappear, or the person might float through a wall. This is because current AI video models are like artists who are great at painting textures but terrible at understanding physics. They focus on making things look pretty (appearance) but forget how the human body actually moves (structure).

The paper introduces MoSA, a new system that fixes this by splitting the job into two specialized teams, much like a movie production crew.

The Core Idea: The "Skeleton" and the "Skin"

MoSA uses a strategy called Structure-Appearance Decoupling. Think of it like building a house:

The Frame (Structure): First, you build the wooden skeleton of the house. It doesn't have paint or windows yet, but it ensures the house stands up straight and the rooms are in the right place.
The Paint (Appearance): Once the frame is solid, you paint the walls, add the windows, and put up the curtains.

Most AI tries to do both at once, which leads to wobbly, impossible houses. MoSA does them separately to ensure the "house" (the human) is physically possible before adding the "paint."

How MoSA Works: The Three Magic Tools

1. The 3D Architect (Structure Generation)

Instead of guessing how a person moves, MoSA first asks a specialized AI architect to build a 3D skeleton based on your text prompt (e.g., "A girl running up stairs").

The Analogy: Imagine a puppeteer building a wireframe puppet in 3D space before the show starts.
Why 3D? If you only draw a 2D stick figure, it's hard to know if an arm is in front of or behind a body. By building it in 3D first, the AI understands depth. If a leg is hidden behind a tree, the 3D architect knows it's still there, just occluded, preventing the AI from "erasing" the leg or making it pass through the tree.

2. The Dynamic Spotlight (Human-Aware Control)

Once the 3D skeleton is ready, it's projected onto a 2D video. But a stick figure is too simple to guide a realistic video; it's like trying to direct a movie using only a rough sketch.

The Analogy: Imagine a stage director with a smart spotlight. The director doesn't just shine a light on the whole stage; they use a "dynamic control" system to shine a bright, focused beam exactly where the actor's hands and feet are moving, telling the video generator, "Pay attention here, this is where the action is."
The Result: This ensures the AI pays extra attention to the moving body parts, making the motion smooth and detailed, rather than just blurring them out.

3. The Gravity & Contact Check (Contact Constraint)

One of the biggest problems with AI videos is "ghosting"—where a person walks through a wall or their foot sinks into the floor.

The Analogy: MoSA adds a physics teacher to the team. Before the video is finalized, this teacher checks: "Is the foot touching the ground? Is the hand hitting the ball?" If the AI tries to make a person walk through a wall, the teacher slams the brakes and says, "No, that's impossible!"
The Result: The person interacts with the environment realistically, like feet pressing into grass or hands grabbing a railing.

The New "Gym" for AI (The MoVid Dataset)

To train this system, the researchers realized existing video datasets were like a gym with only treadmills. They mostly had videos of people standing still, talking, or doing simple dance moves. They lacked videos of people running, jumping, or doing complex sports.

The Solution: They built MoVid, a massive new dataset with 30,000 videos of complex, whole-body movements. It's like upgrading the gym to include a full obstacle course, a climbing wall, and a trampoline. This allows the AI to learn how humans actually move in the real world.

The Final Result

When you put it all together, MoSA is like a director who hires a structural engineer, a lighting specialist, and a physics consultant before the cameras start rolling.

Old AI: "Here is a video of a person running. Oh, look, their legs are melting into the ground, but the shirt looks great!"
MoSA: "Here is a video of a person running. The legs are moving correctly, the feet hit the ground, the arms swing naturally, and the shirt looks great."

In short, MoSA stops the AI from just "guessing" what a human looks like and starts teaching it how a human actually works, resulting in videos that are not just pretty, but physically believable.

1. Problem Statement

Existing video generation models (e.g., Diffusion Transformers) struggle to generate realistic human videos, particularly when dealing with complex motions like whole-body dynamics, long-range movement, and human-environment interactions.

Core Issue: Current models prioritize appearance fidelity (visual texture, lighting) over structural coherence (anatomical correctness, physical plausibility).
Consequences: This leads to artifacts such as distorted body parts, impossible limb movements, and physically implausible interactions with the environment (e.g., legs passing through objects).
Data Limitation: Existing human video datasets are often limited to facial expressions, upper-body movements, or simple dance videos, lacking the diversity and complexity required to train models for realistic full-body motion.

2. Methodology: MoSA Framework

The authors propose MoSA, a framework that decouples the video generation process into two distinct branches: Structure Generation and Appearance Generation. This separation allows the model to enforce physical constraints on the motion structure before synthesizing the visual appearance.

A. Structure-Appearance Decoupling

Structure Generation Branch ( $\mathcal{G}_s$ ):
- Input: A text prompt is preprocessed to extract motion-specific semantics ( $p'$ ), filtering out appearance details.
- Mechanism: A 3D Structure Transformer generates a sequence of 3D human keypoints conditioned on the text.
- Advantage: Generating in 3D space leverages human priors and implicit depth information, ensuring anatomical plausibility and handling limb occlusions better than direct 2D skeleton generation.
- Output: The 3D keypoints are projected into a 2D skeleton sequence ( $g_s$ ) to serve as structural guidance.
Appearance Generation Branch ( $\mathcal{G}_a$ ):
- Base: Built upon a Diffusion Transformer (DiT) backbone (e.g., CogVideoX or Wan 2.1).
- Input: The original text prompt ( $p$ ) and the structural features derived from the skeleton.
- Function: Synthesizes the video pixels (texture, background, lighting) while adhering to the structural guidance provided by the first branch.

B. Key Technical Modules

To address the sparsity of skeleton guidance and ensure motion coherence, MoSA introduces three specific innovations:

Human-Aware Dynamic Control (HADC) Modules:
- Problem: Sparse skeleton features lack the granularity to control fine-grained motion in dense video latents.
- Solution: HADC modules are inserted between DiT blocks. They use learnable dynamic weight predictors to generate spatially varying weight maps ( $w_k$ ) based on the skeleton features.
- Mechanism: These weights refine the video latents, propagating the sparse structural guidance across the entire human motion region. A Mask Loss ( $\mathcal{L}_m$ ) ensures these weights align with ground-truth human masks, enhancing fine-grained controllability.
Dense Tracking Loss ( $\mathcal{L}_{track}$ ):
- Problem: Standard noise prediction objectives often fail to maintain temporal consistency over long sequences.
- Solution: A loss function that penalizes inconsistencies in point tracking between generated frames and ground truth.
- Mechanism: Uses CoTracker3 to extract 2D tracks. The loss assigns higher weights to longer time intervals, forcing the model to learn long-range motion dependencies and ensuring smooth, coherent movement.
3D Contact Constraint ( $\mathcal{L}_{cont}$ ):
- Problem: Models often generate physically impossible interactions (e.g., feet penetrating the ground or objects).
- Solution: A constraint that models human-environment interactions in 3D space.
- Mechanism: It lifts video frames to 3D point clouds, separates human and scene points, and constructs a Signed Distance Function (SDF) of the scene. The loss penalizes human points that penetrate the scene surface, ensuring physically plausible contact.

3. Key Contributions

Novel Framework: The first work to explicitly decouple structure and appearance generation for human video, demonstrating that disentangling these tasks significantly improves physical plausibility.
Advanced Modules: Introduction of HADC, dense tracking loss, and contact constraints, which collectively solve issues of sparse guidance, temporal incoherence, and physical impossibility.
MoVid Dataset: Construction of a large-scale, high-quality human video dataset (MoVid) containing 30,000 clips.
- Unlike existing datasets (which focus on faces or simple dances), MoVid features complex whole-body motions, diverse environments, and intricate human-environment interactions.
- It includes fine-grained text annotations and 3D structural data.

4. Experimental Results

The authors conducted comprehensive evaluations against state-of-the-art general video generation models (e.g., Wan 2.1, HunyuanVideo, CogVideoX), human-specific generation models, and animation methods.

Quantitative Performance: MoSA achieved superior results across all metrics:
- FVD (Fréchet Video Distance): 1093 (Lower is better), outperforming Wan 2.1 (1251) and HunyuanVideo (1235).
- CLIP Similarity: 0.3035 (Higher is better), indicating better alignment with text prompts.
- VBench Scores: Significant improvements in Subject Consistency (96.83%), Background Consistency (97.43%), and Motion Smoothness (99.25%).
Qualitative Results: Visual comparisons show MoSA generates videos with correct anatomical structures even in complex scenarios (e.g., running up stairs, skating, interacting with objects) where baselines fail with distorted limbs or impossible physics.
Ablation Studies:
- Removing the 3D structure branch and using 2D skeletons resulted in missing limbs and occlusion errors.
- Removing HADC or the tracking loss led to jerky motion and loss of fine-grained control.
- Removing the contact constraint resulted in objects being penetrated by the human subject.
- Training on MoVid was proven essential; models trained on existing datasets (like HumanVid) failed to generalize to complex motions.

5. Significance and Impact

Paradigm Shift: MoSA challenges the "end-to-end" pixel generation paradigm by proving that explicit structural modeling is necessary for physically coherent human video generation.
Data Benchmark: The release of MoVid addresses a critical bottleneck in the field, providing the community with a dataset capable of training models for complex, real-world human dynamics.
Applications: The framework is highly applicable to virtual avatars, film production, and simulation, where anatomical correctness and physical realism are non-negotiable.
Future Work: The authors note that while hand interactions remain challenging due to data limitations, the decoupled architecture is naturally compatible with integrating denser structural cues (like hand keypoints) in future iterations.

In summary, MoSA represents a significant leap forward in human video generation by combining a 3D structural prior with advanced dynamic control mechanisms and a rich, complex dataset to produce videos that are not only visually high-fidelity but also physically plausible.