SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

This paper introduces SWIFT, a training-free, few-shot video attribution method that leverages a sliding window reconstruction mechanism to detect generated content with high accuracy across multiple state-of-the-art models without requiring additional training or degrading video quality.

Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you've just watched a stunning, hyper-realistic video of a dragon flying over a city. It looks so real that you can't tell if it was filmed with a camera or created by an AI. Now, imagine you need to prove which specific AI made it. Was it "Model A" or "Model B"?

This is the problem the paper SWIFT solves. It's a new tool designed to act like a digital detective for AI videos, but with a superpower: it doesn't need to be trained on thousands of examples, and it doesn't need to touch the video before it's made.

Here is the breakdown of how SWIFT works, using simple analogies.

1. The Problem: The "Black Box" Mystery

Currently, if you want to trace a video back to its creator, you usually have to do one of two things:

  • The Watermark (Active): The AI company secretly stamps a hidden code into the video while making it. But this can ruin the video quality, and not all companies do it.
  • The Training School (Passive): You build a massive school where you show a detective AI thousands of videos from every possible generator so it learns to spot the differences. This takes forever, costs a fortune, and if a new AI model appears tomorrow, your detective is useless until you train it again.

SWIFT says: "No school, no watermarks, no waiting." It works with just a handful of examples (few-shot) and needs zero training.

2. The Secret Ingredient: The "Time-Compression" Magic

To understand SWIFT, you have to understand how modern AI video makers work. They use a special engine called a 3D VAE.

Think of a 3D VAE like a magic time-compressor.

  • When an AI generates a video, it doesn't look at every single frame individually.
  • Instead, it grabs a chunk of, say, 4 frames, squishes them together into one single "latent" frame (a compressed summary), and then expands them back out.
  • Because of this, those 4 frames have a very specific, tight relationship with each other. They are "best friends" in the AI's eyes. If you mess up the order, the AI gets confused.

3. The SWIFT Trick: The "Sliding Window" Test

SWIFT acts like a torture test for the video. It asks: "Does this video respect the specific time-rules of the AI that made it?"

Here is the step-by-step process:

Step A: The "Normal" Look

SWIFT takes the video and looks at it in chunks, exactly how the AI originally saw it. It tries to "reconstruct" (re-draw) the video using the AI's own tools.

  • Result: If the video was made by that AI, the reconstruction is perfect. The "Time-Compression" rules are followed, so the AI is happy. The error (loss) is low.

Step B: The "Corrupted" Look

Now, SWIFT shifts its view. It slides its window over by just a few frames.

  • Imagine you have a sentence: "The cat sat on the mat."
  • The AI expects to see "The cat" as one unit.
  • SWIFT shifts the window so it tries to read "cat sat" as one unit.
  • Result: This breaks the "Time-Compression" rules. The AI's tools are now trying to squish together frames that don't belong together in its compressed format.
  • The Reaction: The AI gets confused and makes a mess. The reconstruction is terrible. The error (loss) is huge.

Step C: The Verdict

SWIFT compares the two results:

  • If the video was made by the AI: The "Normal" look was easy (low error), but the "Corrupted" look was a disaster (high error). The difference is massive. SWIFT says: "This is a match!"
  • If the video was made by a different AI or a real camera: The video never had those specific "Time-Compression" rules to begin with. So, whether SWIFT looks at it normally or shifts the window, the AI's tools are equally confused in both cases. The error is high in both scenarios. The difference is tiny. SWIFT says: "Not a match."

4. Why is this a Big Deal?

  • It's Fast: Instead of studying a whole library of videos, it just does a quick "sliding window" test on the video in front of it.
  • It's Flexible: It works on almost any modern AI video generator (like Sora, Hunyuan, Wan) because they all use this "time-compression" trick.
  • It's "Few-Shot": You only need about 20 examples of a video to set the rules. You don't need a million.
  • It's "Zero-Shot" for some: For some models, it works perfectly even with zero examples, just by knowing the math of how the AI thinks.

The Analogy Summary

Imagine you are trying to identify a specific type of origami paper.

  • Old Method: You have to buy a huge library of every paper type and learn to recognize them by touch (Training).
  • SWIFT Method: You take a piece of paper and try to fold it using a specific, weird fold that only one type of paper can handle without tearing.
    • If the paper is the right type, it folds perfectly in the first try, but tears apart if you shift the fold slightly.
    • If the paper is the wrong type, it tears apart in both tries.

By seeing how much the paper "tears" when you shift the fold, you instantly know if it's the right paper, without ever having seen that paper before.

In short: SWIFT is a clever, lightweight detective that catches AI videos by tripping them up with a slightly shifted timeline, proving exactly which AI created them.