Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Imagine you are a detective trying to spot a forgery. In the past, you might have looked at a painting and checked the brushstrokes, or listened to a voice recording and checked for static. But today, scammers are using super-smart AI to create "deepfakes"—videos where a person's face and voice are swapped or faked so perfectly that they look and sound real.

This paper introduces a new detective: AV-LMMDetect. Here is how it works, explained simply:

1. The Problem: The "Specialist" vs. The "Generalist"

Old Detectives (Small Models): Imagine hiring a specialist who only knows how to spot fake eyes or fake lips. They are great at their specific job, but if the scammer changes the trick (like faking the voice instead of the face), the specialist gets confused. They are like a lockpick who can only open one specific type of door.
The New Detective (AV-LMMDetect): This is a "Generalist" detective. It's built on a massive, super-smart AI brain (called Qwen 2.5 Omni) that has read almost everything on the internet and watched countless videos. It doesn't just look at the face or listen to the voice separately; it understands how they should work together.

2. The Big Idea: Asking a Simple Question

Instead of running complex, separate tests, the researchers taught this AI a simple game. They showed it a video and asked one question:

"Is this video Real or Fake?"

The AI is trained to answer with just one word: "Real" or "Fake."

3. How They Trained the Detective (The Two-Stage Workout)

You can't just hand a super-smart AI a deepfake and expect it to know the answer immediately. It needs training. The researchers used a two-step workout routine:

Stage 1: The "Alignment" Stretch (LoRA)
Imagine the AI is a brilliant student who knows everything but doesn't know how to take a specific test. In this stage, they "stretch" the AI's brain to understand the rules of the game without changing its core knowledge. They teach it: "When I ask 'Real or Fake?', you must pick one of those two words." This is quick and efficient.
Stage 2: The "Full Muscle" Build (Full Fine-Tuning)
Now, they unlock the AI's eyes and ears (the visual and audio encoders). They let the AI study thousands of deepfakes and real videos, learning to spot the tiny, invisible glitches where the lips don't quite match the voice, or the lighting doesn't match the sound. This is where it learns the subtle "fingerprints" of a forgery.

4. The Results: Beating the Scammers

The researchers tested their new detective against the old specialists and the "base" AI (the one before training).

The Base AI: If you asked the untrained AI, it would say, "I'm not sure, maybe real, maybe fake?" It was too polite and unsure.
The Old Specialists: They did okay on videos they had seen before but failed miserably when the scammers tried new tricks or different languages.
AV-LMMDetect (The Winner):
- On standard tests, it matched the best experts.
- On the hardest tests (where the scammers used new languages and new AI tools the detective had never seen), it crushed the competition.
- The Analogy: If the old detectors are like a security guard who only checks IDs from one country, AV-LMMDetect is like a polyglot security guard who can spot a fake passport from any country, even if they've never seen that specific passport before.

5. Why This Matters

Deepfakes are becoming a huge threat to trust in news, politics, and security. This paper proves that instead of building tiny, fragile tools for every new type of fake, we can use one giant, smart AI and teach it to be a master detective. It's more flexible, more accurate, and much harder for scammers to fool.

In short: They took a super-smart AI, gave it a two-step training program to become a deepfake detective, and found that it can spot fakes better than any previous method, especially when the fakes are trying to be tricky and new.

1. Problem Statement

Audio-Visual Deepfake Detection (AVD) has become critical due to the rise of generative AI capable of fabricating convincing speech and video.

Limitations of Current Methods: Existing detectors are typically small, task-specific models (e.g., CNNs, Capsule networks). While effective on curated datasets, they suffer from poor scalability, weak generalization across domains, and an inability to handle distribution shifts.
Modality Gaps: Vision-only systems ignore cross-modal inconsistencies (e.g., lip-sync errors), while audio-only Large Language Models (LLMs) lack visual context.
The Gap: There is a lack of unified, large-scale models that can jointly reason over audio and visual streams to detect deepfakes with high generalizability.

2. Methodology: AV-LMMDetect

The authors propose AV-LMMDetect, a supervised fine-tuned (SFT) Large Multimodal Model (LMM) based on Qwen 2.5 Omni. The core innovation is reframing deepfake detection as a prompted binary classification task ("Is this video real or fake?").

A. Two-Stage Training Strategy

To balance efficiency with performance, the authors employ a specific two-stage training regimen:

Stage 1: LoRA Alignment (Lightweight):
- The vision and audio encoders are frozen.
- Low-Rank Adaptation (LoRA) is applied to the language model to align its reasoning capabilities with the detection task.
- Goal: Efficient adaptation to the specific instruction ("Answer 'Real' or 'Fake'") while preserving the model's general knowledge.
Stage 2: Audio-Visual Encoder Full Fine-tuning:
- The vision and audio encoders are unlocked.
- Full fine-tuning is conducted across all modalities.
- Goal: Maximize cross-modal synergy, allowing the model to learn task-specific representations and capture subtle audio-visual inconsistencies indicative of manipulation.

B. Formulation as Question Answering

Input: A video with an audio track.
Prompt: "Given the video, please assess if it's Real or Fake?"
Output: The model generates a token from a constrained vocabulary: "Real" (Token ID: 12768) or "Fake" (Token ID: 52317).
Loss Function: Standard language modeling loss is minimized over the fine-tuning dataset.

C. Evaluation Metrics

The study uses standard binary classification metrics: Accuracy (Acc), Area Under the Curve (AUC), and Mean Average Precision (mAP), with a focus on probabilistic outputs to handle different decision thresholds.

3. Key Contributions

First SFT LMM for AVD: Introduction of AV-LMMDetect, the first supervised fine-tuned large multimodal model designed for end-to-end audio-visual deepfake detection via prompted classification.
Novel Training Strategy: A two-stage approach (LoRA alignment $\rightarrow$ Full Encoder Fine-tuning) that preserves computational efficiency during initial alignment while achieving strong cross-modal performance in the final stage.
State-of-the-Art Performance: Demonstrated competitive results on the FakeAVCeleb dataset and set new state-of-the-art (SOTA) benchmarks on the challenging MAVOS-DD dataset.

4. Experimental Results

The model was evaluated on two major benchmarks: FakeAVCeleb and MAVOS-DD (a multilingual dataset with 250+ hours of data).

A. Performance on FakeAVCeleb (Intra-manipulation)

AV-LMMDetect achieved 98.02% Accuracy and 99.2% AUC.
This performance is comparable to the current SOTA method (AVFF: 98.6% Acc, 99.1% AUC) and significantly outperforms traditional vision-only baselines (e.g., MesoNet at 57.3% Acc, Xception at 67.9% Acc).

B. Performance on MAVOS-DD (Generalization/Open-Set)

This dataset tests generalization across unseen models, languages, and domains.

Open-Set Full Scenario (Most Challenging): AV-LMMDetect achieved 85.09% Accuracy, 0.92 AUC, and 0.96 mAP.
Comparison: It significantly outperformed the base Qwen 2.5 Omni model (32.26% Acc) and other fine-tuned baselines like AVFF (77.68% Acc) and MRDF (78.87% Acc).
Key Insight: The SFT formulation enabled superior cross-modal reasoning, particularly when encountering unseen generative models.

C. Ablation Studies

Zero-shot: 32.26% Accuracy.
Stage 1 Only (LoRA): 73.40% Accuracy.
Stage 2 Only (Full Tuning): 80.61% Accuracy.
Stage 1 + Stage 2 (Full Method): 85.09% Accuracy.
Conclusion: Both stages are essential; LoRA aligns the logic, while full encoder tuning captures deep multimodal features.

D. Confusion Matrix Analysis

In the Open-Set Full scenario, AV-LMMDetect showed only 14.9% False Negatives (missing a fake) and 7.5% False Positives, significantly outperforming competitors like TALL (40.1% False Negatives) and AVFF (28.0% False Negatives).

5. Significance and Conclusion

Paradigm Shift: The paper validates that Large Multimodal Models, when properly supervised fine-tuned, are a viable and superior path for robust deepfake detection compared to small, specialized pipelines.
Generalization: The approach demonstrates that LMMs can generalize better across diverse manipulation techniques, languages, and unseen generative models due to their massive pre-training data and inherent video-audio pairing structures.
Future Impact: This work suggests that future media forensics should move toward unified, instruction-tuned multimodal models rather than fragmented, modality-specific detectors.