Consistency-Preserving Diverse Video Generation

Imagine you are a director trying to film a scene. You have a very expensive camera that takes a long time to set up and shoot. Because it's so costly, you can only afford to shoot one take per day.

In the world of AI video generation, this is exactly the problem. Generating a video from a text prompt (like "a cat playing piano") is computationally expensive. Usually, you get one video per prompt. But what if you could shoot four takes at once? You could pick the best one, or use all four for different purposes.

However, there's a catch:

Diversity: You want the four takes to look different (maybe one is sunny, one is rainy, one is from a different angle).
Consistency: Within each individual video, the cat shouldn't suddenly turn into a dog, or the piano shouldn't disappear and reappear. The video needs to flow smoothly.

Previous methods tried to make the videos diverse, but they were like a clumsy director who yelled "Action!" so hard that the actors forgot their lines. The videos became diverse, but they were glitchy and flickered (bad consistency). Also, checking if the video looked good required "decoding" the whole thing, which was like developing film in a darkroom—slow and expensive.

The Paper's Solution: The "Smart Editor"

This paper introduces a new way to generate multiple videos at once that are both different from each other and smooth within themselves. They call it a "Consistency-Preserving Joint Sampling Framework."

Here is how it works, using simple analogies:

1. The "Ghost" Camera (Latent Space Models)

Normally, to check if a video is good, the AI has to fully render it (turn the math into a real video file) and then check it. This is slow.

The Paper's Trick: Instead of developing the full film, they train a tiny, lightweight "Ghost Camera" that works on the blueprints (the latent space) of the video.
The Analogy: Imagine you are an architect. Instead of building a full house to check if the rooms are too far apart, you use a quick, rough sketch to measure the distances. The paper uses these "sketches" to check for diversity and consistency instantly, without the expensive "construction" (decoding).

2. The "Push and Pull" Dance (Gradient Regulation)

The AI is trying to generate a batch of videos. It has two competing goals:

Goal A (Diversity): "Hey videos, spread out! Be different from each other!" (Pushing them apart).
Goal B (Consistency): "Hey video, stay smooth! Don't flicker or glitch!" (Keeping it steady).

Previous methods would push the videos apart so hard that they broke the smoothness.

The Paper's Trick: They use a "Smart Editor" (Gradient Regulation).
The Analogy: Imagine you are pushing a group of people (the videos) to spread out in a room.
- If someone starts walking toward a wall (which would ruin the video's consistency), the editor gently blocks that specific direction.
- But if someone is walking toward an open door (which is safe and makes them more diverse), the editor lets them go.
- Result: The group spreads out nicely (diverse), but no one crashes into the wall (consistent).

3. The "Time-Traveler" Check (Temporal Consistency)

To make sure the video doesn't glitch, the system checks if the frames flow logically.

The Analogy: Imagine a flipbook animation. If you flip through the pages, the character should move smoothly. If you skip a page, the character should still be in a logical position.
The paper's system uses a "Time-Traveler" model. It looks at a frame and asks, "If I skip this frame and look at the next one, does this frame fit in the middle?" If the answer is "No," the system adjusts the blueprint so the frame fits perfectly.

The Results: Why It Matters

The researchers tested this on a state-of-the-art video generator.

The Old Way: You get diverse videos, but they look like a broken VCR tape (flickering, weird colors).
The New Way: You get videos that are just as diverse as the old way, but they look natural, smooth, and high-quality.

In a nutshell:
This paper teaches AI how to shoot multiple movie takes at once without the director going crazy. It uses a "quick sketch" method to save time and a "smart editor" to ensure that while the movies are different from each other, each individual movie tells a smooth, coherent story.

Key Takeaway: You don't have to choose between variety and quality anymore. You can have both, and you can do it faster.

1. Problem Statement

Text-to-video generation is computationally expensive, often limiting the number of samples generated per prompt. To maximize utility, researchers aim to generate a batch of diverse videos simultaneously (joint sampling). However, existing methods for enhancing diversity face two critical challenges in the video domain:

Temporal Consistency Degradation: Methods that push samples apart to increase diversity often disrupt the internal coherence of individual videos, leading to flickering or inconsistent motion.
Computational Cost: Previous approaches for diverse image generation compute gradients in image space, requiring backpropagation through a video decoder. For high-dimensional video data, this is memory-intensive and often infeasible to perform in parallel.

Goal: Develop a framework that maximizes cross-video diversity (differences between generated videos) while strictly preserving intra-video temporal consistency (coherence within a single video), all without the heavy computational cost of decoder backpropagation.

2. Methodology

The authors propose a Consistency-Preserving Joint Sampling Framework based on Flow Matching. The core innovation lies in performing all diversity and consistency calculations in the latent space rather than the pixel space.

A. Latent-Space Guidance

Instead of decoding images to compute gradients, the authors train lightweight latent-space models to estimate objectives directly from the latent representations ( $x_t$ ).

Diversity Objective ( $O_d$ ): Uses a Determinantal Point Process (DPP) to encourage diversity. It computes embeddings for both the entire video ( $e_v$ ) and individual frames ( $e_f$ ) using learned latent models ( $M_v, M_f$ ). The diversity gradient pushes samples apart based on these embeddings.
Consistency Objective ( $O_c$ ): Uses a learned latent frame-interpolation model ( $M_c$ ). It measures how well a frame can be predicted from its neighbors. A high consistency score implies smooth temporal transitions.

B. Gradient Regulation (The Core Mechanism)

The method introduces a gradient regulation strategy to balance diversity and consistency:

Compute the diversity gradient ( $g_d$ ) to push samples apart.
Compute the consistency gradient ( $g_c$ ) to maintain temporal coherence.
Regulation: The diversity gradient is modified to remove only the component that would negatively impact consistency.
- The projection of $g_d$ onto $g_c$ is calculated.
- If the projection is negative (meaning the diversity update hurts consistency), that component is discarded.
- The resulting regulated gradient ( $g_{reg}$ ) is added to the flow-matching velocity field.
- Result: Updates that are neutral or beneficial to consistency are retained, while harmful updates are filtered out.

C. Training Latent Models

To avoid decoder passes, the authors train three small convolutional networks:

$M_v$ & $M_f$ : Video and Frame embedding models trained to mimic frozen encoders (VideoPrism-B and CLIP) in the latent space. They are supervised to preserve pairwise dot products and alignment with the reference encoders.
$M_c$ : A frame interpolation model trained to predict a latent frame from its neighbors, regularized to stay close to linear interpolation.

3. Key Contributions

Consistency-Preserving Joint Sampling: A novel framework for flow-matching video generators that uses gradient regulation to maintain temporal consistency while maximizing batch diversity.
Latent-Space Efficiency: The introduction of lightweight latent embedding and interpolation models that allow diversity and consistency objectives to be computed without decoding or backpropagating through the video decoder, significantly reducing memory usage.
Gradient Regulation Mechanism: A mathematical formulation that selectively filters diversity updates, ensuring they do not degrade the temporal consistency objective.

4. Experimental Results

The method was evaluated on Wan 2.1 t2v-1.3B, a state-of-the-art text-to-video flow-matching model.

Diversity: The proposed method achieved Vendi scores (a metric for diversity) comparable to strong baselines like DiverseFlow and Particle Guidance, and significantly higher than independent (IID) sampling.
Temporal Consistency: The method substantially outperformed baselines.
- MSE (Mean Squared Error): Lower error (0.0019) compared to baselines (~0.0028–0.0029), indicating smoother, more coherent videos.
- Color Naturalness (CNI): Achieved a higher score (0.69) compared to baselines (0.65), suggesting better visual quality.
Ablation Study: Removing the consistency regulation led to a drop in consistency metrics, confirming the necessity of the gradient filtering mechanism. Removing the video-level diversity term slightly reduced diversity but improved consistency, highlighting the trade-off managed by the full method.

5. Significance

This work addresses a critical bottleneck in generative video AI: the trade-off between generating many diverse options and maintaining high-quality, coherent outputs.

Scalability: By moving gradient computations to the latent space, the method makes diverse joint sampling feasible for high-resolution, long-duration videos where decoder-based approaches fail due to memory constraints.
Quality Preservation: It demonstrates that diversity and consistency are not mutually exclusive; with proper gradient regulation, one can achieve high diversity without sacrificing the temporal smoothness required for realistic video generation.
Practical Application: This approach is highly relevant for media content creation and virtual reality, where users often need multiple distinct variations of a scene without manual re-generation costs.