Mode Seeking meets Mean Seeking for Fast Long Video Generation

This paper proposes a Decoupled Diffusion Transformer that combines a global Flow Matching head for long-term narrative coherence with a local Distribution Matching head for short-video fidelity, enabling the fast generation of high-quality, minute-scale videos by effectively bridging the gap between limited long-form data and abundant short-form data.

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you want to teach a robot to tell a story. You have two very different problems to solve:

  1. The "Local" Problem: Every single sentence the robot speaks needs to be clear, sharp, and grammatically perfect.
  2. The "Global" Problem: The entire story needs to make sense from start to finish. The character shouldn't forget who they are, and the plot shouldn't jump randomly from a beach to a spaceship without explanation.

For a long time, AI video generators were great at Problem 1 (making short, 5-second clips that look amazing) but terrible at Problem 2 (making a 1-minute video that stays coherent). If you tried to force them to make long videos, the quality would get blurry, the characters would melt, and the story would fall apart.

This paper, "Mode Seeking meets Mean Seeking," introduces a clever new way to teach the AI to do both at once. Here is how it works, using some everyday analogies.

The Core Problem: The "Interpolation" Trap

The authors point out a common mistake in AI training. People thought that making a long video was just like making a high-resolution image.

  • Image Analogy: If you have a 256x256 pixel image, making it 1024x1024 is just "filling in the gaps" with more detail. It's the same picture, just sharper.
  • Video Reality: A 1-minute video is not just a longer version of a 5-second clip. It's a completely different beast. A 5-second clip is a snapshot; a 1-minute video is a movie with a plot, cause-and-effect, and new events happening.

When researchers tried to train AI on a mix of short and long videos, the AI got confused. It tried to "average out" the differences. The result? The video looked like a blurry, dream-like mess where nothing moved sharply, and the story made no sense.

The Solution: The "Student" and the "Teacher"

The authors propose a training method that splits the job into two distinct roles, using a Decoupled Diffusion Transformer (DDT). Think of this as hiring a team with two specialized coaches.

1. The "Mean Seeking" Coach (The Storyteller)

  • The Job: This coach is in charge of the big picture.
  • The Analogy: Imagine a film director who has watched very few long movies (because long, high-quality movies are rare and expensive). This director is bad at lighting and camera angles, but they are great at understanding plot structure. They know that if a character picks up a gun in scene 1, they should probably use it in scene 3.
  • How it works: The AI uses a "Flow Matching" head to learn from these rare, long videos. It learns the "mean" (the average, logical flow) of how a story should unfold over time. It ensures the video doesn't drift off into nonsense.

2. The "Mode Seeking" Coach (The Art Critic)

  • The Job: This coach is in charge of local details.
  • The Analogy: Imagine a world-famous cinematographer who has made thousands of perfect 5-second commercials. They know exactly how light hits a face, how hair moves in the wind, and how to make things look "real." However, they have never made a long movie and don't care about the plot.
  • How it works: The AI uses a "Distribution Matching" head. It constantly checks every 5-second chunk of the long video it is generating and asks the cinematographer: "Does this specific moment look as sharp and real as your best commercials?"
  • The Magic: The AI forces the long video to "seek" the high-quality "modes" (the best, sharpest examples) of the short-video teacher. It doesn't average them out; it copies the sharpness.

How They Work Together: The "Sliding Window"

The genius of this paper is how these two coaches talk to each other without fighting.

Imagine the AI is generating a 1-minute video. It breaks the video into overlapping 5-second "windows" (like looking through a sliding window on a train).

  • The Director (Mean Seeking) looks at the whole train ride to make sure the route is logical.
  • The Cinematographer (Mode Seeking) looks through the window at the current 5-second view to make sure the scenery looks crystal clear.

The AI uses a shared brain (the encoder) to understand the context, but it has two separate hands (the heads) to execute the tasks. One hand writes the story; the other hand paints the details.

The Result: Fast and Sharp

Because the AI learns the "art" from the short-video teacher, it doesn't need to relearn how to make things look real from scratch. This allows it to generate long videos in just a few steps (very fast), rather than taking hours.

In summary:

  • Old Way: Trying to teach one student to be both a master storyteller and a master painter using a messy mix of data. Result: A blurry, confusing mess.
  • New Way: Hire a Story Director (who knows the plot) and a Master Painter (who knows the details). Let the Director guide the flow of the movie, and let the Painter fix the details of every single frame.

The result is a video generator that can create minute-long, coherent stories that still look as sharp and realistic as a 5-second Hollywood commercial.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →