SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Imagine you've just watched a stunning, hyper-realistic video of a dragon flying over a city. It looks so real that you can't tell if it was filmed with a camera or created by an AI. Now, imagine you need to prove which specific AI made it. Was it "Model A" or "Model B"?

This is the problem the paper SWIFT solves. It's a new tool designed to act like a digital detective for AI videos, but with a superpower: it doesn't need to be trained on thousands of examples, and it doesn't need to touch the video before it's made.

Here is the breakdown of how SWIFT works, using simple analogies.

1. The Problem: The "Black Box" Mystery

Currently, if you want to trace a video back to its creator, you usually have to do one of two things:

The Watermark (Active): The AI company secretly stamps a hidden code into the video while making it. But this can ruin the video quality, and not all companies do it.
The Training School (Passive): You build a massive school where you show a detective AI thousands of videos from every possible generator so it learns to spot the differences. This takes forever, costs a fortune, and if a new AI model appears tomorrow, your detective is useless until you train it again.

SWIFT says: "No school, no watermarks, no waiting." It works with just a handful of examples (few-shot) and needs zero training.

2. The Secret Ingredient: The "Time-Compression" Magic

To understand SWIFT, you have to understand how modern AI video makers work. They use a special engine called a 3D VAE.

Think of a 3D VAE like a magic time-compressor.

When an AI generates a video, it doesn't look at every single frame individually.
Instead, it grabs a chunk of, say, 4 frames, squishes them together into one single "latent" frame (a compressed summary), and then expands them back out.
Because of this, those 4 frames have a very specific, tight relationship with each other. They are "best friends" in the AI's eyes. If you mess up the order, the AI gets confused.

3. The SWIFT Trick: The "Sliding Window" Test

SWIFT acts like a torture test for the video. It asks: "Does this video respect the specific time-rules of the AI that made it?"

Here is the step-by-step process:

Step A: The "Normal" Look

SWIFT takes the video and looks at it in chunks, exactly how the AI originally saw it. It tries to "reconstruct" (re-draw) the video using the AI's own tools.

Result: If the video was made by that AI, the reconstruction is perfect. The "Time-Compression" rules are followed, so the AI is happy. The error (loss) is low.

Step B: The "Corrupted" Look

Now, SWIFT shifts its view. It slides its window over by just a few frames.

Imagine you have a sentence: "The cat sat on the mat."
The AI expects to see "The cat" as one unit.
SWIFT shifts the window so it tries to read "cat sat" as one unit.
Result: This breaks the "Time-Compression" rules. The AI's tools are now trying to squish together frames that don't belong together in its compressed format.
The Reaction: The AI gets confused and makes a mess. The reconstruction is terrible. The error (loss) is huge.

Step C: The Verdict

SWIFT compares the two results:

If the video was made by the AI: The "Normal" look was easy (low error), but the "Corrupted" look was a disaster (high error). The difference is massive. SWIFT says: "This is a match!"
If the video was made by a different AI or a real camera: The video never had those specific "Time-Compression" rules to begin with. So, whether SWIFT looks at it normally or shifts the window, the AI's tools are equally confused in both cases. The error is high in both scenarios. The difference is tiny. SWIFT says: "Not a match."

4. Why is this a Big Deal?

It's Fast: Instead of studying a whole library of videos, it just does a quick "sliding window" test on the video in front of it.
It's Flexible: It works on almost any modern AI video generator (like Sora, Hunyuan, Wan) because they all use this "time-compression" trick.
It's "Few-Shot": You only need about 20 examples of a video to set the rules. You don't need a million.
It's "Zero-Shot" for some: For some models, it works perfectly even with zero examples, just by knowing the math of how the AI thinks.

The Analogy Summary

Imagine you are trying to identify a specific type of origami paper.

Old Method: You have to buy a huge library of every paper type and learn to recognize them by touch (Training).
SWIFT Method: You take a piece of paper and try to fold it using a specific, weird fold that only one type of paper can handle without tearing.
- If the paper is the right type, it folds perfectly in the first try, but tears apart if you shift the fold slightly.
- If the paper is the wrong type, it tears apart in both tries.

By seeing how much the paper "tears" when you shift the fold, you instantly know if it's the right paper, without ever having seen that paper before.

In short: SWIFT is a clever, lightweight detective that catches AI videos by tripping them up with a slightly shifted timeline, proving exactly which AI created them.

1. Problem Statement

The rapid advancement of video generation technologies (e.g., Sora, HunyuanVideo, Wan2.1) has led to widespread adoption but also raised significant concerns regarding misuse, such as intellectual property infringement and the spread of disinformation.

The Challenge: Identifying the source of a generated video (attribution) is crucial for accountability.
Limitations of Existing Methods:
- Active Attribution (Watermarking): Requires embedding ownership information during or after generation, which degrades video quality or requires complex key management.
- Passive Attribution (Training-based): Requires training a dedicated source-tracing model on large datasets. This is computationally expensive, requires massive labeled data, and fails when new models emerge (lack of generalization).
The Gap: There is a lack of effective methods for Few-Shot Training-Free attribution that can identify the generator of a video with minimal samples and no model retraining, while preserving video quality.

2. Methodology: SWIFT

The authors propose SWIFT (Sliding Window Reconstruction for Few-Shot Training-Free), a passive attribution framework that leverages the inherent temporal characteristics of modern video generation models.

Core Insight

State-of-the-art video generation models utilize 3D Variational Autoencoders (VAEs) to handle high computational demands. These 3D VAEs perform temporal compression, creating a specific mapping relationship within video chunks:

"Pixel Frames (many) ↔ Latent Frame (one)"

This mapping ensures temporal consistency within a chunk. If the temporal order of frames within a chunk is disrupted, the VAE's reconstruction loss increases significantly because the input no longer matches the learned distribution.

The SWIFT Framework

The method operates in three main stages:

Fixed-Length Sliding Window:
- The input video is divided into chunks based on the temporal compression ratio ( $K$ ) of the target model's 3D VAE (typically $K=4$ or $8$).
- Two distinct windows are defined:
  - Normal Window ( $W_0$ ): Aligned with the chunk boundaries, preserving the natural temporal mapping.
  - Corrupted Window ( $W_{K-1}$ ): Shifted by $K-1$ frames. This misalignment disrupts the "many-to-one" pixel-to-latent mapping, breaking temporal consistency.
Dual Reconstruction & Signal Generation:
- The auditor (with white-box access to the target model's VAE) reconstructs both windows.
- Belonging Videos: Generated by the target model. The Normal reconstruction yields low loss, while the Corrupted reconstruction yields high loss due to broken temporal consistency.
- Non-Belonging Videos: Generated by other models or real videos. They do not share the specific VAE distribution, so both reconstructions yield similar (high) losses.
- Attribution Signal ( $t$ ): Calculated as the average loss ratio of overlapping frames between the Normal and Corrupted reconstructions:
  $t = \frac{\sum L_{normal}}{\sum L_{corrupted}}$
- A low ratio ( $t < 1$ ) indicates the video belongs to the target model; a ratio near 1 indicates it does not.
Threshold Determination (KDE):
- To avoid manual threshold tuning, the method uses Kernel Density Estimation (KDE) on a small set of known belonging videos (few-shot) to adaptively determine the decision threshold $\tau$ .

3. Key Contributions

New Paradigm: Formally defines the "Few-Shot Training-Free Generated Video Attribution" task, addressing the need for efficient, low-resource source tracing.
Novel Framework (SWIFT): The first method to explicitly exploit the temporal mapping inherent in 3D VAEs. It achieves attribution by measuring the sensitivity of the VAE to temporal disruptions (sliding window reconstruction) rather than training a classifier.
High Efficiency & Generalization:
- Requires no training of attribution models.
- Requires no watermarking or post-processing of the video.
- Demonstrates Zero-Shot capability for certain models (no samples needed) and Few-Shot capability (20 samples) for others.

4. Experimental Results

The authors evaluated SWIFT on 5 State-of-the-Art (SOTA) video generation models: HunyuanVideo, Wan2.1, EasyAnimate, LTX-Video, and Wan2.2.

Accuracy:
- Achieved an average attribution accuracy of 94.0% across all models.
- Significantly outperformed the baseline (AEDR, an image attribution method adapted for video), which achieved only 73.6%.
- Achieved >90% accuracy with only 20 video samples for threshold estimation.
- Demonstrated Zero-Shot attribution (using a fixed threshold of 1.0) for HunyuanVideo, EasyAnimate, and Wan2.2.
Efficiency:
- SWIFT is 4% to 32% faster than AEDR because it reconstructs only sliding windows rather than the entire video.
- Runtime ranges from 14s to 218s depending on video length and model complexity.
Robustness:
- Maintained superior performance over baselines under post-processing operations like cropping, compression, flipping, and noise addition, though accuracy naturally decreased with aggressive transformations (e.g., flipping reduced accuracy by ~45% but remained higher than the baseline).
Generalization:
- Effective across models with different VAE compression ratios (e.g., $8\times8\times4 $vs.$ 32\times32\times8$).
- Slight performance drop observed in models where the VAE includes a denoising step during decoding (e.g., LTX), as this reduces the intrinsic discrepancy between original and reconstructed frames.

5. Significance

Practicality: SWIFT offers a deployable solution for copyright protection and misinformation detection without requiring the video generator to embed watermarks or the auditor to train massive models.
Scalability: The "training-free" nature means the method can be immediately applied to new video generation models as they are released, provided their VAE architecture is accessible.
Theoretical Insight: The work highlights that the temporal consistency enforced by 3D VAEs is a unique fingerprint of the generation process, which can be exploited for forensic analysis without altering the content.

In conclusion, SWIFT represents a significant step forward in video forensics, shifting the paradigm from data-hungry training to efficient, reconstruction-based signal detection that leverages the fundamental architecture of modern generative AI.