Consistency-Preserving Diverse Video Generation

This paper proposes a joint-sampling framework for flow-matching video generators that enhances batch diversity while preserving temporal consistency by applying diversity-driven updates and selectively removing components that harm consistency, all computed efficiently in the latent space to avoid costly video decoding.

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

Published 2026-02-18
📖 4 min read☕ Coffee break read

Imagine you are a director trying to film a scene. You have a very expensive camera that takes a long time to set up and shoot. Because it's so costly, you can only afford to shoot one take per day.

In the world of AI video generation, this is exactly the problem. Generating a video from a text prompt (like "a cat playing piano") is computationally expensive. Usually, you get one video per prompt. But what if you could shoot four takes at once? You could pick the best one, or use all four for different purposes.

However, there's a catch:

  1. Diversity: You want the four takes to look different (maybe one is sunny, one is rainy, one is from a different angle).
  2. Consistency: Within each individual video, the cat shouldn't suddenly turn into a dog, or the piano shouldn't disappear and reappear. The video needs to flow smoothly.

Previous methods tried to make the videos diverse, but they were like a clumsy director who yelled "Action!" so hard that the actors forgot their lines. The videos became diverse, but they were glitchy and flickered (bad consistency). Also, checking if the video looked good required "decoding" the whole thing, which was like developing film in a darkroom—slow and expensive.

The Paper's Solution: The "Smart Editor"

This paper introduces a new way to generate multiple videos at once that are both different from each other and smooth within themselves. They call it a "Consistency-Preserving Joint Sampling Framework."

Here is how it works, using simple analogies:

1. The "Ghost" Camera (Latent Space Models)

Normally, to check if a video is good, the AI has to fully render it (turn the math into a real video file) and then check it. This is slow.

  • The Paper's Trick: Instead of developing the full film, they train a tiny, lightweight "Ghost Camera" that works on the blueprints (the latent space) of the video.
  • The Analogy: Imagine you are an architect. Instead of building a full house to check if the rooms are too far apart, you use a quick, rough sketch to measure the distances. The paper uses these "sketches" to check for diversity and consistency instantly, without the expensive "construction" (decoding).

2. The "Push and Pull" Dance (Gradient Regulation)

The AI is trying to generate a batch of videos. It has two competing goals:

  • Goal A (Diversity): "Hey videos, spread out! Be different from each other!" (Pushing them apart).
  • Goal B (Consistency): "Hey video, stay smooth! Don't flicker or glitch!" (Keeping it steady).

Previous methods would push the videos apart so hard that they broke the smoothness.

  • The Paper's Trick: They use a "Smart Editor" (Gradient Regulation).
  • The Analogy: Imagine you are pushing a group of people (the videos) to spread out in a room.
    • If someone starts walking toward a wall (which would ruin the video's consistency), the editor gently blocks that specific direction.
    • But if someone is walking toward an open door (which is safe and makes them more diverse), the editor lets them go.
    • Result: The group spreads out nicely (diverse), but no one crashes into the wall (consistent).

3. The "Time-Traveler" Check (Temporal Consistency)

To make sure the video doesn't glitch, the system checks if the frames flow logically.

  • The Analogy: Imagine a flipbook animation. If you flip through the pages, the character should move smoothly. If you skip a page, the character should still be in a logical position.
  • The paper's system uses a "Time-Traveler" model. It looks at a frame and asks, "If I skip this frame and look at the next one, does this frame fit in the middle?" If the answer is "No," the system adjusts the blueprint so the frame fits perfectly.

The Results: Why It Matters

The researchers tested this on a state-of-the-art video generator.

  • The Old Way: You get diverse videos, but they look like a broken VCR tape (flickering, weird colors).
  • The New Way: You get videos that are just as diverse as the old way, but they look natural, smooth, and high-quality.

In a nutshell:
This paper teaches AI how to shoot multiple movie takes at once without the director going crazy. It uses a "quick sketch" method to save time and a "smart editor" to ensure that while the movies are different from each other, each individual movie tells a smooth, coherent story.

Key Takeaway: You don't have to choose between variety and quality anymore. You can have both, and you can do it faster.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →