MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

MoSA is a novel framework that decouples human video generation into structure and appearance components, utilizing a 3D structure transformer and specialized constraints to achieve superior motion coherence and realistic human-environment interactions compared to existing models.

Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to draw a video of a person running. If you just tell the robot, "Draw a person running," it might get the clothes and the face perfect, but the legs might twist backward, the arms might disappear, or the person might float through a wall. This is because current AI video models are like artists who are great at painting textures but terrible at understanding physics. They focus on making things look pretty (appearance) but forget how the human body actually moves (structure).

The paper introduces MoSA, a new system that fixes this by splitting the job into two specialized teams, much like a movie production crew.

The Core Idea: The "Skeleton" and the "Skin"

MoSA uses a strategy called Structure-Appearance Decoupling. Think of it like building a house:

  1. The Frame (Structure): First, you build the wooden skeleton of the house. It doesn't have paint or windows yet, but it ensures the house stands up straight and the rooms are in the right place.
  2. The Paint (Appearance): Once the frame is solid, you paint the walls, add the windows, and put up the curtains.

Most AI tries to do both at once, which leads to wobbly, impossible houses. MoSA does them separately to ensure the "house" (the human) is physically possible before adding the "paint."

How MoSA Works: The Three Magic Tools

1. The 3D Architect (Structure Generation)

Instead of guessing how a person moves, MoSA first asks a specialized AI architect to build a 3D skeleton based on your text prompt (e.g., "A girl running up stairs").

  • The Analogy: Imagine a puppeteer building a wireframe puppet in 3D space before the show starts.
  • Why 3D? If you only draw a 2D stick figure, it's hard to know if an arm is in front of or behind a body. By building it in 3D first, the AI understands depth. If a leg is hidden behind a tree, the 3D architect knows it's still there, just occluded, preventing the AI from "erasing" the leg or making it pass through the tree.

2. The Dynamic Spotlight (Human-Aware Control)

Once the 3D skeleton is ready, it's projected onto a 2D video. But a stick figure is too simple to guide a realistic video; it's like trying to direct a movie using only a rough sketch.

  • The Analogy: Imagine a stage director with a smart spotlight. The director doesn't just shine a light on the whole stage; they use a "dynamic control" system to shine a bright, focused beam exactly where the actor's hands and feet are moving, telling the video generator, "Pay attention here, this is where the action is."
  • The Result: This ensures the AI pays extra attention to the moving body parts, making the motion smooth and detailed, rather than just blurring them out.

3. The Gravity & Contact Check (Contact Constraint)

One of the biggest problems with AI videos is "ghosting"—where a person walks through a wall or their foot sinks into the floor.

  • The Analogy: MoSA adds a physics teacher to the team. Before the video is finalized, this teacher checks: "Is the foot touching the ground? Is the hand hitting the ball?" If the AI tries to make a person walk through a wall, the teacher slams the brakes and says, "No, that's impossible!"
  • The Result: The person interacts with the environment realistically, like feet pressing into grass or hands grabbing a railing.

The New "Gym" for AI (The MoVid Dataset)

To train this system, the researchers realized existing video datasets were like a gym with only treadmills. They mostly had videos of people standing still, talking, or doing simple dance moves. They lacked videos of people running, jumping, or doing complex sports.

  • The Solution: They built MoVid, a massive new dataset with 30,000 videos of complex, whole-body movements. It's like upgrading the gym to include a full obstacle course, a climbing wall, and a trampoline. This allows the AI to learn how humans actually move in the real world.

The Final Result

When you put it all together, MoSA is like a director who hires a structural engineer, a lighting specialist, and a physics consultant before the cameras start rolling.

  • Old AI: "Here is a video of a person running. Oh, look, their legs are melting into the ground, but the shirt looks great!"
  • MoSA: "Here is a video of a person running. The legs are moving correctly, the feet hit the ground, the arms swing naturally, and the shirt looks great."

In short, MoSA stops the AI from just "guessing" what a human looks like and starts teaching it how a human actually works, resulting in videos that are not just pretty, but physically believable.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →