BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Imagine the internet is a massive, bustling marketplace. For a long time, it was easy to tell the difference between a real photo taken by a human and a fake one made by a computer. But recently, "digital magicians" (AI video generators) have become so good at their tricks that they can create videos so realistic, even experts can't tell them apart from reality.

This paper introduces a new team of "Digital Detectives" called BusterX, along with a new training ground and a rulebook to help them catch these fakes.

Here is the breakdown of their mission, explained simply:

1. The Problem: The "Magic Trick" is Getting Better

Think of old AI videos like a child's drawing of a cat. You could easily spot the ears were wrong or the tail was missing. But today's AI is like a master illusionist. It can make a video of a person talking, walking, or dancing that looks 100% real.

The old "detectives" (previous AI tools) were like security guards who only knew how to spot the child's drawings. When the master illusionists showed up, the guards got confused and let the fakes through. Also, the old tools were "black boxes"—they would just say "Fake!" without explaining why, which made it hard for humans to trust them.

2. The New Training Ground: GenBuster-200K

To train better detectives, you need better practice materials. The authors realized the old practice videos were too easy (like training a police dog on a squeaky toy).

They built GenBuster-200K, a massive library of over 200,000 videos.

The Mix: It has real videos and super-realistic fake videos.
The Fairness Rule: They made sure the library wasn't biased. Just like a real city has people of all ages, genders, and backgrounds, this dataset includes everyone. They didn't just train the AI on "young men in suits"; they trained it on everyone, everywhere.
The "Wild" Zone: They even included videos that had been compressed by social media (like TikTok or YouTube), because real fakes usually get squished and pixelated when people share them.

3. The New Rulebook: GenBuster-Bench

Instead of just giving the detectives a final exam with one big test, the authors created a three-level challenge course:

Level 1 (The Classroom): Can the detective spot fakes made by the tools they've seen before? (Easy mode).
Level 2 (The New Villain): Can the detective spot fakes made by brand new tools they've never seen? This tests if they learned the principles of forgery or just memorized the old tricks.
Level 3 (The Real World): Can the detective spot a fake that has been posted on social media, compressed, and shared a hundred times? This is the hardest test.

They also added a "Judge" (another AI) that doesn't just check if the answer is right, but grades the explanation. If the detective says "It's fake because the eyes are weird," the Judge checks: "Did you actually look at the eyes, or did you just guess?"

4. The Star Detective: BusterX

Meet BusterX. Unlike the old guards who just shouted "Fake!" or "Real!", BusterX is a Reasoning Detective.

How it works: Instead of guessing, BusterX puts on its thinking cap and writes a step-by-step report. It looks at the video frame by frame and asks questions like:
- "Does the shadow move with the sun?"
- "Do the person's clothes ripple naturally when they walk?"
- "Is the skin texture too smooth, like plastic?"
The Secret Weapon (Reinforcement Learning): The authors didn't just teach BusterX the answers. They used a technique called Reinforcement Learning. Imagine a dog trainer: every time BusterX finds a real clue and explains it well, it gets a "treat" (a reward). If it guesses wrong or writes a lazy explanation, it gets a gentle "no." Over time, BusterX learns to think like a human forensic expert.

5. The Results: Why This Matters

When they put BusterX to the test:

Old Detectors: They failed miserably on the "Real World" level, getting confused by new AI tools.
Big AI Models: Some giant AI models were good at guessing, but they often got it wrong because they were biased (e.g., they thought everything was fake, or nothing was fake).
BusterX: It won. It didn't just guess; it provided reasons. It could tell you exactly why a video was fake (e.g., "The person's hand flickers between frames"), and it stayed calm and accurate even when the video was messy or from a new AI generator.

The Big Picture

This paper is like upgrading from a metal detector that beeps at everything to a forensic scientist with a magnifying glass.

In a world where AI can create deepfakes that could ruin reputations or spread lies, we need tools that don't just say "That's a lie," but can prove it with clear, logical evidence. BusterX is that tool, ready to help us keep the truth safe in a world of digital magic.

and...` structure.
* Length Control: A soft penalty for overly verbose responses to prevent training collapse.
* Accuracy-Based Length: Rewards correct classifications with longer, more detailed reasoning.

3. Key Contributions

GenBuster-200K: The first large-scale, high-fidelity dataset that addresses the bias and quality limitations of previous collections, featuring 23 diverse generators and strict demographic balancing.
GenBuster-Bench: A novel diagnostic benchmark with three progressive tracks (ID, OOD, Wild) and a standardized MLLM-as-a-Judge protocol to evaluate both accuracy and the quality of forensic reasoning.
BusterX: The first MLLM-based video forgery detection framework trained with Reinforcement Learning. It treats the reasoning chain itself as the detector, achieving state-of-the-art performance in both detection accuracy and interpretability.

4. Experimental Results

Detection Accuracy: BusterX achieves 85.5% accuracy on the In-Domain track and 84.9% on the Out-of-Domain track, significantly outperforming classical detectors (e.g., DeMamba, Vivit) and leading closed-source MLLMs (e.g., Claude-Sonnet-4.6, Qwen3.5).
Robustness: In the "In-the-Wild" track, BusterX maintains 81.5% accuracy, whereas many baselines collapse (e.g., Claude-Haiku-4.6 drops to 19.3% on fake samples).
Explainability: BusterX achieves the highest Rationale score (83.9%) among all models, demonstrating its ability to provide human-aligned, physics-based forensic evidence.
Generalization: When tested on the external GVF dataset without retraining, BusterX achieved accuracies of 89.3%–94.7%, outperforming the previous state-of-the-art (DeCoF) by over 15%.
Ablation Studies: The combination of SFT and RL is crucial; RL alone improves OOD generalization by 11.0% compared to pure SFT, proving that RL helps the model learn universal forensic laws rather than overfitting to specific generator artifacts.

5. Significance

Paradigm Shift: Moves the field from "black-box" binary classification to interpretable visual reasoning, allowing users to understand why a video is flagged as fake.
Future-Proofing: The progressive benchmark design (ID $\to$ OOD $\to$ Wild) provides a robust framework for evaluating detectors against rapidly evolving generative models and real-world conditions.
Fairness: By enforcing strict demographic balancing in the dataset, the work addresses the critical issue of algorithmic bias in deepfake detection, ensuring models perform equitably across different populations.
Practical Deployment: The use of efficient inference frameworks (vLLM, SGLang) and the model's ability to run on standard hardware makes it viable for real-world deployment in content moderation and forensic analysis.

In conclusion, BusterX represents a significant leap forward in AI-generated video detection by combining high-quality data, rigorous diagnostic evaluation, and reinforcement learning to create a system that is not only accurate but also transparent and robust against future threats.

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

1. The Problem: The "Magic Trick" is Getting Better

2. The New Training Ground: GenBuster-200K

3. The New Rulebook: GenBuster-Bench

4. The Star Detective: BusterX

5. The Results: Why This Matters

The Big Picture

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes