CoMoVi: Co-Generation of 3D Human Motions and Realistic… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you want to create a movie scene where a person jumps over a puddle.

In the past, you had two separate problems:

The Choreographer: You needed a 3D computer model of a person to figure out exactly how their joints should move (the "physics" of the jump). But these models often looked stiff, glitchy, or didn't understand the story you wanted to tell.
The Director: You needed a video camera to film it. But if you just asked a video AI to "make a person jump," it often made the person's legs twist into impossible shapes or their body melt like wax because it didn't understand human anatomy.

Usually, you had to do these one after the other: first make the 3D dance, then try to turn it into a video (which often failed), or make a video first and then try to reverse-engineer the 3D dance (which often looked broken).

CoMoVi is like hiring a Super-Director who does both jobs at the exact same time, in perfect sync.

The Big Idea: "The Twin Dance"

The authors realized that 3D movement and 2D video are actually two sides of the same coin. You can't have a realistic video without realistic 3D movement, and you can't have a generalizable 3D movement without the "common sense" that video models have learned from watching millions of real movies.

So, they built a system that generates both the 3D skeleton and the 2D video simultaneously, like a twin dance where they hold hands and never let go.

How It Works (The Magic Tricks)

1. The "Universal Translator" (The 2D Motion Map)

The biggest hurdle is that 3D data (math coordinates) and 2D video (pixels) speak different languages.

The Problem: If you just turn a 3D skeleton into a flat picture, you lose depth. If you just look at a flat picture, you don't know which way the arm is facing (is it the left hand or the right?).
The Solution: The team created a special "Universal Translator" image. Imagine taking a 3D model of a person and painting it with a special code:
- Blue and Green pixels tell you the angle of the skin (like a topographic map).
- Red pixels tell you what body part it is (e.g., "this is a knee," "this is an elbow").
- The Result: This single image looks like a weird, colorful painting, but it contains all the 3D geometry and body part logic hidden inside the colors. This allows the video AI to "see" the 3D structure directly.

2. The "Twin Engines" (Dual-Branch Diffusion)

Most AI video generators are like a single engine trying to do everything. CoMoVi uses two engines working together:

Engine A (The Video): Generates the realistic movie pixels.
Engine B (The Motion): Generates the colorful "Universal Translator" map.
The Connection: They are connected by a "telepathic link" (called Cross-Attention).
- If Engine A starts to make the person's leg look like a noodle, Engine B shouts, "Hey! That's a knee, not a noodle! Fix it!"
- If Engine B makes a movement that looks physically impossible, Engine A says, "That doesn't look like a real human jumping; let's smooth that out."
- They constantly whisper to each other, ensuring the video looks real and the movement makes sense.

3. The "Training Gym" (CoMoVi-Dataset)

To teach these twins how to dance, you need a massive gym with millions of examples.

Existing gyms were either full of low-quality videos or just 3D data without real-world context.
The authors built a new, massive gym (CoMoVi-Dataset) with 50,000 high-quality videos of real people, complete with text descriptions ("a man running") and perfect 3D motion data. This is the "textbook" the AI studied to learn how humans actually move.

Why Is This a Big Deal?

No More "Uncanny Valley": Because the 3D structure guides the video, the people in the generated videos don't have melting faces or extra fingers. Their bodies stay solid and anatomically correct.
No More "Scriptwriters": You don't need to hire a human to make a 3D animation first. You just type a prompt (e.g., "A woman doing a backflip"), and the system creates the 3D motion and the video instantly.
Better Storytelling: Because the system learned from real videos, it understands the feeling of movement, not just the math. The resulting videos look cinematic and natural.

The Analogy Summary

Think of making a movie with CoMoVi like building a house:

Old Way: You hire an architect to draw the blueprints (3D motion), then a builder tries to build the house based on those drawings. If the drawings are slightly off, the house looks weird. Or, you hire a builder to build a house, then an architect tries to draw the blueprints from the finished house, and the drawings are messy.
CoMoVi Way: You have a Super-Builder who holds the blueprint in one hand and the bricks in the other. As they lay a brick (video pixel), they check the blueprint (3D motion) instantly. If the brick doesn't fit the blueprint, they adjust it immediately. The result is a house that is structurally perfect and looks exactly like the drawing.

In short, CoMoVi is the first system to successfully marry the "math" of 3D movement with the "art" of video generation, creating realistic human videos without needing any pre-made reference clips.

1. Problem Statement

The generation of 3D human motions and realistic 2D human-centric videos are intrinsically coupled tasks, yet existing approaches treat them as separate or cascaded processes, leading to suboptimal results:

Text-to-Motion (T2M) Limitations: Traditional T2M models are constrained by the scarcity of high-quality 3D motion data, resulting in limited generalization and low prompt fidelity.
Video-to-Motion Limitations: Methods that generate video first and then recover 3D motion often produce videos with implausible body structures (e.g., inconsistent limbs), which corrupts the recovered 3D data.
Cascaded Frameworks: Current "Motion-to-Video" or "Video-to-Motion" pipelines propagate errors from the upstream model to the downstream task and fail to leverage the mutual benefits of both modalities.
Data Gap: There is a lack of large-scale datasets that simultaneously provide high-resolution real-world videos, precise 3D motion annotations (SMPL), and text descriptions.

Goal: To develop a framework that synchronously generates high-quality 3D human motions and realistic videos within a single diffusion loop, leveraging the structural priors of 3D motion for video consistency and the generalization capabilities of pre-trained video models for motion generation.

2. Methodology

CoMoVi introduces a co-generative framework that operates in a single diffusion denoising loop. The methodology consists of three core components:

A. Novel 2D Human Motion Representation

To bridge the modality gap between 3D motion (SMPL mesh) and 2D video, the authors propose a unified 2D representation that encodes both surface normals and body part semantics into a single RGB image.

Mechanism:
- Normal Maps: Encoded into the Blue and Green channels ( $v_{nx}, v_{ny}$ ).
- Sign Ambiguity: The sign of the Z-component of the normal ( $v_{nz}$ ) is ambiguous in 2D. This sign is combined with body part semantics.
- Semantic Encoding: The Red channel encodes the body part ID and the sign of $v_{nz}$ . A color list is sampled for each body part, where even indices represent positive $v_{nz}$ and odd indices represent negative $v_{nz}$ .
Benefit: This representation preserves rich 3D structural geometry and semantic distinctions while remaining pixel-aligned with RGB videos, allowing it to be processed by pre-trained Video Diffusion Models (VDMs).

B. Dual-Branch Diffusion Architecture

The model is built upon a pre-trained VDM (Wan2.2-I2V-5B) and extended into a dual-branch architecture:

Branch 1 ( $D_{video}$ ): Generates the RGB video sequence.
Branch 2 ( $D_{motion}$ ): Generates the 2D motion representation sequence.
Mutual Feature Interaction: Zero-linear modules are inserted between the branches to allow feature exchange, ensuring the video generation is guided by robust motion priors and vice versa.
3D-2D Cross-Attention: A dedicated module takes the fused latent features from both branches and the initial pose to directly estimate the 3D SMPL parameters. This avoids the need for post-hoc optimization to recover 3D motion from 2D maps.

C. Training Strategy

The training is conducted in two progressive stages:

Adaptation Stage: Fine-tunes $D_{motion}$ to adapt the pre-trained VDM weights to the 2D motion representation domain.
Co-Generation Stage: Unfreezes the interaction modules and cross-attention. The total loss includes:
- Flow matching loss for video ( $\mathcal{L}^{video}$ ).
- Flow matching loss for motion ( $\mathcal{L}^{motion}$ ).
- 3D Regularization Loss ( $\mathcal{L}^{smpl}$ ): Minimizes the distance between the estimated 3D motion and the ground truth SMPL, ensuring structural consistency.

3. Key Contributions

CoMoVi Framework: A novel synchronous co-generation framework that unifies 3D motion and 2D video generation, enabling mutual information exchange and eliminating the error propagation found in cascaded pipelines.
Unified 2D Motion Representation: A pixel-aligned encoding scheme that successfully compresses 3D surface normals and body semantics into an RGB image, effectively bridging the modality gap for VDMs.
CoMoVi-Dataset: A curated large-scale dataset containing ~54,000 high-resolution (720P+) real-world human videos. Crucially, it includes synchronized text descriptions and precise 3D SMPL motion annotations, addressing the data scarcity in this domain.
Direct 3D Estimation: A mechanism to output 3D human motion directly from the diffusion latents via cross-attention, removing the need for external motion capture algorithms or optimization steps.

4. Experimental Results

The authors evaluated CoMoVi on the Motion-X++ benchmark, VBench, and their own CoMoVi-Dataset.

3D Motion Generation:
- Outperforms State-of-the-Art (SoTA) T2M models (e.g., MDM, MotionGPT, Go-to-Zero) in FID (0.349 vs. 1.641 for Go-to-Zero-7B) and R-Precision.
- Demonstrates superior generalization on unseen datasets (Motion-X++), generating smoother and more prompt-faithful motions.
Video Generation:
- Achieves high scores in VBench metrics (Subject Consistency: 0.955, Motion Smoothness: 0.993).
- Generates videos with anatomically plausible body structures and consistent backgrounds, outperforming cascaded baselines (T2M + Motion-driven Video) and pure I2V models (CogVideoX, Wan2.2) which often suffer from body distortion or motion misalignment.
Ablation Studies:
- Confirmed that the unified 2D representation (Normals + Semantics) is superior to using only normals, only semantics, or standard 2D poses (DWPose).
- Validated that the dual-branch architecture with full copy strategy and 3D regularization ( $\mathcal{L}^{smpl}$ ) is essential for maintaining structural consistency.

5. Significance

Paradigm Shift: CoMoVi moves away from the "generate one, recover the other" paradigm to a synchronous co-generation approach, proving that 3D and 2D modalities can mutually reinforce each other during the diffusion process.
Data Resource: The release of CoMoVi-Dataset fills a critical gap in the community, providing a high-quality resource for training models that require aligned text, video, and 3D motion data.
Application Potential: The ability to generate high-fidelity 3D motions and realistic videos simultaneously without external references is highly valuable for character animation, VR/AR, gaming, and film production, where both visual realism and physical correctness are paramount.
Efficiency: By directly estimating 3D motion from diffusion latents, the method avoids the computational overhead and error accumulation of separate motion capture pipelines.

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos