Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

This paper introduces Rolling Sink, a training-free method that bridges the train-test gap in autoregressive video diffusion models by leveraging systematic cache maintenance analysis to enable stable, high-fidelity open-ended video generation far beyond the model's limited training horizon.

Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker

Published 2026-03-06
📖 5 min read🧠 Deep dive

The Big Problem: The "Amnesia" of AI Video

Imagine you are teaching a robot to paint a movie scene. You show the robot a 5-second clip of a cat walking down a street. The robot learns perfectly how to paint that cat, the street, and the lighting for those 5 seconds.

Now, you ask the robot to keep painting the movie for 30 minutes.

What happens?
The robot starts to get confused.

  • The Cat: After 2 minutes, the cat might turn into a dog. After 5 minutes, the cat might have three heads.
  • The Colors: The sunny street suddenly turns neon pink or deep purple.
  • The Motion: The smooth walking turns into a jittery, vibrating mess.

This is called "AR Drift." It happens because the robot was only trained on short clips. When it tries to guess what happens next (beyond the 5 seconds it knows), it starts making mistakes, and those mistakes pile up like a snowball rolling downhill, eventually destroying the video.

The Old Solution: "The Static Anchor" (Self Forcing)

Previous researchers tried to fix this by telling the robot: "Hey, remember the very first frame you ever saw? Keep that image in your mind forever as a reference."

In the paper, this is called Self Forcing with an Attention Sink.

  • The Analogy: Imagine you are writing a long story. To keep the characters consistent, you keep the first page of the story pinned to your desk and read it before writing every new sentence.
  • The Result: This helps a little. The colors stay somewhat stable. But the robot still gets confused about time. It forgets that the story is moving forward, leading to flickering images and the story looping back on itself (like a video getting stuck in a loop).

The New Solution: "Rolling Sink"

The authors of this paper realized that just pinning the first page isn't enough. You need to keep the context fresh and moving, just like a real movie. They invented Rolling Sink.

Here is how it works, broken down into three simple steps:

1. The "Memory Window" (The Cache)

The robot has a limited amount of memory (a "cache") to hold the recent past. It can only hold about 6 seconds of video at a time.

  • Old Way: It held the very first 6 seconds forever.
  • Rolling Sink Way: It holds the most recent 6 seconds. As new frames are made, the oldest ones are kicked out.

2. The "Sliding Time" (Sliding Indices)

In the old way, the robot thought the "first frame" was always at time 0. But in a 30-minute movie, the first frame is actually at time 0, and the frame 5 minutes later is at time 300.

  • The Fix: Rolling Sink tells the robot: "Don't treat the memory as a static photo. Treat it as a moving timeline." As the video grows, the "memory window" slides forward. The robot knows that the frames it is looking at are current, not ancient history.

3. The "Rolling Content" (Sliding Semantics)

This is the magic trick.

  • The Problem: If the robot only remembers the first few seconds of the video, it gets stuck in a loop. It keeps trying to recreate the start of the movie over and over.
  • The Fix: Imagine you are watching a long movie. You don't need to remember the exact first frame to understand the plot; you need to remember the flow of the story.
    • Rolling Sink takes the "memory window" and rolls it.
    • It takes the content from the "middle" of the video and cycles it back into the memory.
    • Analogy: Imagine a conveyor belt in a factory.
      • Old Method: The belt stops, and you keep staring at the first box on the belt.
      • Rolling Sink: The belt keeps moving. When a box falls off the end, a new box (a slightly different version of an earlier box) slides in from the side. This keeps the "flavor" of the video consistent without getting stuck on the very first moment.

Why is this a Big Deal?

  1. No Extra Training: Usually, to make a robot good at long movies, you have to feed it thousands of hours of long movies to train it. That is incredibly expensive and slow.
    • Rolling Sink is training-free. It takes a robot trained on 5-second clips and lets it generate 30-minute movies just by changing how it remembers things.
  2. It Works: The paper shows that while other robots turn into a blurry, colorful mess after 1 minute, Rolling Sink can generate a 30-minute video where the character looks the same, the colors stay true, and the motion is smooth.

Summary Analogy

  • The Robot: A painter who only knows how to paint 5 seconds of a scene.
  • The Problem: When asked to paint for an hour, the painter forgets what the subject looked like and starts hallucinating.
  • The Old Fix: The painter keeps a photo of the subject's face on the wall. (Helps a bit, but the painting still gets weird).
  • The Rolling Sink Fix: The painter puts the photo on a moving carousel. As the painting progresses, the photo rotates to show the subject in slightly different, relevant poses from the "recent past," keeping the painter focused on the flow of the story rather than a frozen moment.

The Result: You can now generate open-ended, hours-long videos with a model that was only taught on short clips, simply by teaching it how to "roll" its memory correctly.