Human Video Generation from a Single Image with 3D Pose and View Control

This paper presents HVG, a latent video diffusion model that generates high-quality, multi-view, and spatiotemporally coherent human videos from a single image by leveraging articulated pose modulation, view-temporal alignment, and progressive spatio-temporal sampling to achieve precise 3D pose and view control.

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you have a single, perfect photograph of a friend standing still. Now, imagine you want to turn that photo into a full movie where your friend dances, spins, and walks around, all while the camera flies around them to show every angle.

That is the magic trick this paper, HVG (Human Video Generation in 4D), is trying to perform.

Here is the story of how they did it, explained without the heavy technical jargon.

The Problem: The "Flat" vs. The "Rigid"

Before HVG, other AI methods tried to do this, but they had two main flaws:

  1. The "Stick Figure" Problem: Some methods used 2D skeletons (like a stick figure drawing) to tell the AI how to move. This works okay if the person just turns their head. But if they spin around, the AI gets confused. It might twist an arm backward like a pretzel or make a leg disappear because it doesn't understand that arms have volume and can't pass through bodies. It's like trying to direct a play using only a shadow puppet; the AI doesn't know where the "real" body is in 3D space.
  2. The "Mannequin" Problem: Other methods used 3D digital mannequins (called SMPL) that have a fixed shape. The problem is, real people wear clothes! If your friend is wearing a big, fluffy coat, a rigid mannequin can't show the coat flapping in the wind. It treats the clothes like part of the skin, leading to weird "shape leaking" where the coat looks like it's melting into the legs.

The Solution: HVG's Three Secret Weapons

The authors built a new system called HVG that solves these problems using three clever tricks.

1. The "3D Bone Map" (The Invisible Skeleton)

Instead of using a flat stick figure or a rigid mannequin, HVG creates a 3D "Bone Map."

  • The Analogy: Imagine your friend's skeleton isn't just thin lines, but is made of soft, 3D sausages (ellipsoids) connecting the joints.
  • Why it works: These "sausages" have thickness. When the AI sees the arm cross in front of the body, the "sausage" knows it's blocking the view. It knows exactly how much space the arm takes up. This prevents the AI from making impossible moves (like a hip dislocating) and keeps the clothes looking real, because the AI knows the clothes are draped over these 3D shapes, not stuck to a flat surface.

2. The "Centering Trick" (Keeping the Camera Calm)

When you watch a video of someone spinning, if the camera stays fixed, the person moves from the left side of the screen to the right. This confuses the AI because it has to constantly relearn where the person is.

  • The Analogy: Imagine a stagehand who constantly moves the actor so they are always standing in the exact center of the stage, no matter which way they turn.
  • Why it works: HVG uses a "View Alignment" strategy. It mathematically shifts the person so they stay centered in the "AI's mind" for every camera angle. This makes it much easier for the AI to learn that "this is the same person" from every angle, resulting in a video that doesn't flicker or glitch when the camera moves.

3. The "Puzzle Piece" Strategy (Building the Movie)

Making a long video with many camera angles is like trying to solve a giant puzzle all at once. It's too heavy for the computer's brain, and the edges often don't match up.

  • The Analogy: Instead of trying to paint the whole mural in one go, HVG paints it in small, overlapping tiles. It paints a few seconds of time, then a few camera angles, then overlaps them slightly to blend the edges perfectly.
  • Why it works: This "Progressive Spatio-Temporal Sampling" allows the AI to generate long, smooth videos without running out of memory or creating choppy transitions. It ensures the video flows like butter, even when the camera is spinning wildly.

The Result: A Digital Twin That Breathes

When you put these three tricks together, HVG can take a single photo and generate a 4D video (3D space + time).

  • Clothes look real: You can see wrinkles in a shirt as the person twists.
  • No weird glitches: Limbs don't twist backward, and clothes don't melt into skin.
  • Smooth camera moves: You can watch the person dance from the front, the side, and the back, and it looks like a real camera crew filmed it.

The One Flaw

The paper admits one small weakness: The Face.
Because the AI is so focused on getting the big body movements and the clothes right, the face sometimes gets a little blurry or distorted (like a nose looking slightly off). It's a trade-off: the AI is great at the "big picture" but sometimes misses the tiny details of the face. The authors suggest that in the future, they might use a special "face-only" AI to fix this, like adding a high-definition filter just for the head.

In a Nutshell

HVG is like a super-smart digital puppeteer. Instead of using flat strings (2D skeletons) or stiff mannequins, it uses soft, 3D "sausage bones" to guide the movement, keeps the actor centered so the camera doesn't get dizzy, and builds the movie piece-by-piece to ensure it looks smooth and realistic. It's a huge step toward creating virtual humans that look and move just like the real thing.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →