Mode Seeking meets Mean Seeking for Fast Long Video Generation

Imagine you want to teach a robot to tell a story. You have two very different problems to solve:

The "Local" Problem: Every single sentence the robot speaks needs to be clear, sharp, and grammatically perfect.
The "Global" Problem: The entire story needs to make sense from start to finish. The character shouldn't forget who they are, and the plot shouldn't jump randomly from a beach to a spaceship without explanation.

For a long time, AI video generators were great at Problem 1 (making short, 5-second clips that look amazing) but terrible at Problem 2 (making a 1-minute video that stays coherent). If you tried to force them to make long videos, the quality would get blurry, the characters would melt, and the story would fall apart.

This paper, "Mode Seeking meets Mean Seeking," introduces a clever new way to teach the AI to do both at once. Here is how it works, using some everyday analogies.

The Core Problem: The "Interpolation" Trap

The authors point out a common mistake in AI training. People thought that making a long video was just like making a high-resolution image.

Image Analogy: If you have a 256x256 pixel image, making it 1024x1024 is just "filling in the gaps" with more detail. It's the same picture, just sharper.
Video Reality: A 1-minute video is not just a longer version of a 5-second clip. It's a completely different beast. A 5-second clip is a snapshot; a 1-minute video is a movie with a plot, cause-and-effect, and new events happening.

When researchers tried to train AI on a mix of short and long videos, the AI got confused. It tried to "average out" the differences. The result? The video looked like a blurry, dream-like mess where nothing moved sharply, and the story made no sense.

The Solution: The "Student" and the "Teacher"

The authors propose a training method that splits the job into two distinct roles, using a Decoupled Diffusion Transformer (DDT). Think of this as hiring a team with two specialized coaches.

1. The "Mean Seeking" Coach (The Storyteller)

The Job: This coach is in charge of the big picture.
The Analogy: Imagine a film director who has watched very few long movies (because long, high-quality movies are rare and expensive). This director is bad at lighting and camera angles, but they are great at understanding plot structure. They know that if a character picks up a gun in scene 1, they should probably use it in scene 3.
How it works: The AI uses a "Flow Matching" head to learn from these rare, long videos. It learns the "mean" (the average, logical flow) of how a story should unfold over time. It ensures the video doesn't drift off into nonsense.

2. The "Mode Seeking" Coach (The Art Critic)

The Job: This coach is in charge of local details.
The Analogy: Imagine a world-famous cinematographer who has made thousands of perfect 5-second commercials. They know exactly how light hits a face, how hair moves in the wind, and how to make things look "real." However, they have never made a long movie and don't care about the plot.
How it works: The AI uses a "Distribution Matching" head. It constantly checks every 5-second chunk of the long video it is generating and asks the cinematographer: "Does this specific moment look as sharp and real as your best commercials?"
The Magic: The AI forces the long video to "seek" the high-quality "modes" (the best, sharpest examples) of the short-video teacher. It doesn't average them out; it copies the sharpness.

How They Work Together: The "Sliding Window"

The genius of this paper is how these two coaches talk to each other without fighting.

Imagine the AI is generating a 1-minute video. It breaks the video into overlapping 5-second "windows" (like looking through a sliding window on a train).

The Director (Mean Seeking) looks at the whole train ride to make sure the route is logical.
The Cinematographer (Mode Seeking) looks through the window at the current 5-second view to make sure the scenery looks crystal clear.

The AI uses a shared brain (the encoder) to understand the context, but it has two separate hands (the heads) to execute the tasks. One hand writes the story; the other hand paints the details.

The Result: Fast and Sharp

Because the AI learns the "art" from the short-video teacher, it doesn't need to relearn how to make things look real from scratch. This allows it to generate long videos in just a few steps (very fast), rather than taking hours.

In summary:

Old Way: Trying to teach one student to be both a master storyteller and a master painter using a messy mix of data. Result: A blurry, confusing mess.
New Way: Hire a Story Director (who knows the plot) and a Master Painter (who knows the details). Let the Director guide the flow of the movie, and let the Painter fix the details of every single frame.

The result is a video generator that can create minute-long, coherent stories that still look as sharp and realistic as a 5-second Hollywood commercial.

1. Problem Statement

The paper addresses the critical bottleneck in scaling video generation from short clips (seconds) to long-form content (minutes).

Data Scarcity vs. Abundance: While high-fidelity short-video data is abundant online, coherent, high-quality long-form video data is scarce, expensive to curate, and limited to narrow domains.
The "Interpolation" Fallacy: Existing approaches often treat long video generation as a simple interpolation of short clips (analogous to increasing image resolution). The authors argue this is fundamentally flawed. Moving from a 5-second clip to a 1-minute video is a temporal extrapolation requiring the generation of new events, causal chains, and narrative structures, not just the smoothing of existing patterns.
The Fidelity-Horizon Gap: Current methods face a trade-off:
- Long-video Supervised Fine-Tuning (SFT): Learns long-range coherence but often produces blurry, low-fidelity outputs because it forces the model to relearn high-fidelity priors from scarce, constrained data.
- Teacher Distillation (Short-video Teachers): Preserves local sharpness and motion but fails to maintain long-term narrative consistency, often leading to drift or static content when extended.

2. Methodology: Mode Seeking meets Mean Seeking

The authors propose a training paradigm that decouples local fidelity from long-term coherence using a Decoupled Diffusion Transformer (DDT). The core idea is to use two distinct objectives simultaneously:

A. Architecture: Decoupled Diffusion Transformer (DDT)

The model shares a unified long-context encoder ( $E_\phi$ ) but utilizes two separate lightweight decoder heads:

Flow Matching (FM) Head ( $D^{FM}_\theta$ ): A Mean-Seeking head.
Distribution Matching (DM) Head ( $D^{DM}_\psi$ ): A Mode-Seeking head.

B. The Dual-Objective Training Strategy

Mean-Seeking (Global Coherence):
- Objective: Supervised Flow Matching (SFT) on real, scarce long-video data.
- Function: The FM head learns the global narrative structure and minute-scale temporal dependencies. It acts as an anchor, ensuring the video follows a coherent story over long horizons.
- Mechanism: Minimizes the standard flow-matching loss against ground-truth long videos.
Mode-Seeking (Local Fidelity):
- Objective: Reverse-KL Divergence alignment with a frozen, expert short-video teacher.
- Function: The DM head ensures that every local sliding window of the generated long video matches the high-fidelity distribution of the short-video teacher. This preserves sharp textures, realistic motion, and local details.
- Mechanism: Uses a Distribution Matching Distillation (DMD) approach. Instead of direct KL calculation (which is intractable), it uses a gradient surrogate based on the difference between the student's "fake" score and the teacher's score on noisy sliding windows. This is a mode-seeking objective, pushing the student toward the high-density modes of the teacher rather than averaging them.

C. Inference

At inference time, the FM head is discarded.
The model generates long videos using only the DM head.
Because the DM head is trained via mode-seeking distillation, it functions as a fast, few-step sampler. This allows for the generation of minute-scale videos with both global consistency (learned via the shared encoder during training) and local sharpness, without the computational cost of multi-stage distillation or autoregressive rollouts.

3. Key Contributions

Decoupled Training Paradigm: The first framework to explicitly separate long-range coherence learning (Mean-Seeking SFT) from local realism preservation (Mode-Seeking Teacher Alignment) within a unified architecture.
Sliding-Window Reverse-KL Alignment: A novel method to align every sliding window of a long-video student to a frozen short-video teacher using reverse-KL divergence (via DMD/VSD gradients) without requiring additional short-video data during the student's training.
Fast Few-Step Inference: By leveraging the mode-seeking nature of the DM head, the method enables fast, few-step generation of long videos, overcoming the slow inference speeds typical of autoregressive or multi-step diffusion methods.
DDT Architecture: Demonstrates that a shared encoder with decoupled heads effectively resolves the gradient interference between mean-seeking (averaging) and mode-seeking (sharpness) objectives.

4. Results

The method was evaluated on the Wan 1.3B and 14B models, comparing against Long-context SFT, Mixed-length SFT, and autoregressive baselines (CausVid, Self-Forcing, InfinityRoPE).

Quantitative Performance:
- Achieved the best overall scores in Subject Consistency, Background Consistency, Motion Smoothness, and Aesthetic Quality.
- Outperformed SFT-only methods in local quality (sharpness) and teacher-only methods in long-range consistency.
- Notably, autoregressive methods (like CausVid) showed high consistency scores but were often static or oversaturated; the proposed method maintained high dynamic degrees while preserving consistency.
Qualitative Performance:
- Generated videos maintained crisp foreground subjects and smooth background evolution over 30+ seconds.
- Avoided the "blurriness" of SFT-only models and the "drift/static" issues of teacher-only models.
Ablation Studies: Confirmed that removing either the dual-head design, the sliding-window DMD, or the long-video SFT significantly degraded performance, proving the necessity of all components.

5. Significance

Bridging the Gap: The paper successfully closes the "fidelity-horizon gap," demonstrating that it is possible to generate minute-scale videos that are both narratively coherent and locally realistic.
Efficiency: It offers a path to fast long-video generation without the heavy computational overhead of training massive autoregressive models or performing multi-stage distillation.
Data Efficiency: It maximizes the utility of scarce long-video data for structure while leveraging abundant short-video data for texture, solving a major data bottleneck in the field.
Generalizability: The approach is orthogonal to causal autoregressive methods, suggesting potential for hybrid architectures that combine bidirectional long-context understanding with causal sampling.