Leveraging large multimodal models for audio-video deepfake detection: a pilot study

This paper introduces AV-LMMDetect, a supervised fine-tuned large multimodal model built on Qwen 2.5 Omni that addresses the generalization limitations of existing detectors by casting audio-video deepfake detection as a prompted classification task, achieving state-of-the-art performance on key datasets.

Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to spot a forgery. In the past, you might have looked at a painting and checked the brushstrokes, or listened to a voice recording and checked for static. But today, scammers are using super-smart AI to create "deepfakes"—videos where a person's face and voice are swapped or faked so perfectly that they look and sound real.

This paper introduces a new detective: AV-LMMDetect. Here is how it works, explained simply:

1. The Problem: The "Specialist" vs. The "Generalist"

  • Old Detectives (Small Models): Imagine hiring a specialist who only knows how to spot fake eyes or fake lips. They are great at their specific job, but if the scammer changes the trick (like faking the voice instead of the face), the specialist gets confused. They are like a lockpick who can only open one specific type of door.
  • The New Detective (AV-LMMDetect): This is a "Generalist" detective. It's built on a massive, super-smart AI brain (called Qwen 2.5 Omni) that has read almost everything on the internet and watched countless videos. It doesn't just look at the face or listen to the voice separately; it understands how they should work together.

2. The Big Idea: Asking a Simple Question

Instead of running complex, separate tests, the researchers taught this AI a simple game. They showed it a video and asked one question:

"Is this video Real or Fake?"

The AI is trained to answer with just one word: "Real" or "Fake."

3. How They Trained the Detective (The Two-Stage Workout)

You can't just hand a super-smart AI a deepfake and expect it to know the answer immediately. It needs training. The researchers used a two-step workout routine:

  • Stage 1: The "Alignment" Stretch (LoRA)
    Imagine the AI is a brilliant student who knows everything but doesn't know how to take a specific test. In this stage, they "stretch" the AI's brain to understand the rules of the game without changing its core knowledge. They teach it: "When I ask 'Real or Fake?', you must pick one of those two words." This is quick and efficient.
  • Stage 2: The "Full Muscle" Build (Full Fine-Tuning)
    Now, they unlock the AI's eyes and ears (the visual and audio encoders). They let the AI study thousands of deepfakes and real videos, learning to spot the tiny, invisible glitches where the lips don't quite match the voice, or the lighting doesn't match the sound. This is where it learns the subtle "fingerprints" of a forgery.

4. The Results: Beating the Scammers

The researchers tested their new detective against the old specialists and the "base" AI (the one before training).

  • The Base AI: If you asked the untrained AI, it would say, "I'm not sure, maybe real, maybe fake?" It was too polite and unsure.
  • The Old Specialists: They did okay on videos they had seen before but failed miserably when the scammers tried new tricks or different languages.
  • AV-LMMDetect (The Winner):
    • On standard tests, it matched the best experts.
    • On the hardest tests (where the scammers used new languages and new AI tools the detective had never seen), it crushed the competition.
    • The Analogy: If the old detectors are like a security guard who only checks IDs from one country, AV-LMMDetect is like a polyglot security guard who can spot a fake passport from any country, even if they've never seen that specific passport before.

5. Why This Matters

Deepfakes are becoming a huge threat to trust in news, politics, and security. This paper proves that instead of building tiny, fragile tools for every new type of fake, we can use one giant, smart AI and teach it to be a master detective. It's more flexible, more accurate, and much harder for scammers to fool.

In short: They took a super-smart AI, gave it a two-step training program to become a deepfake detective, and found that it can spot fakes better than any previous method, especially when the fakes are trying to be tricky and new.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →