No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

This paper introduces LAVIDA, a zero-shot video anomaly detection framework that leverages a Multimodal Large Language Model and an Anomaly Exposure Sampler to train exclusively on pseudo-anomalies, achieving state-of-the-art performance across multiple benchmarks without requiring real anomaly data.

Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao

Published 2026-03-12
📖 4 min read☕ Coffee break read

Imagine you are hiring a security guard to watch over a thousand different places: a busy park, a quiet library, a chaotic construction site, and a high-speed racetrack.

The Old Problem:
Traditionally, to train this guard, you'd have to show them thousands of videos of specific bad things happening in specific places.

  • "Here is a fight in a park."
  • "Here is a car crash on a racetrack."
  • "Here is a robbery in a bank."

If a new type of trouble happens—say, someone starts juggling chainsaws in the library—the guard has no idea what to do because they've never seen "chainsaw juggling" before. They are stuck in a "closed world" where they only know what they were explicitly taught.

The New Solution (LAVIDA):
The paper introduces LAVIDA, a new kind of AI security guard. Instead of memorizing specific bad events, LAVIDA learns the concept of "badness" and uses a super-smart brain (a Multimodal Large Language Model, or MLLM) to understand context.

Here is how it works, broken down with simple analogies:

1. The "Fake Crime" Training Camp (Anomaly Exposure Sampler)

You can't train a guard on real crimes because real crimes are rare and dangerous to film. So, LAVIDA uses a clever trick.

  • The Analogy: Imagine you have a photo album of normal animals (parrots, dogs, elephants). To teach the guard what "weird" looks like, you take a picture of a parrot and tell the guard, "In this scene, the parrot is the intruder." Then you take a picture of a car and say, "Here, the car is the intruder."
  • How it works: The AI takes normal images from the internet (like segmentation datasets) and randomly labels different objects as "suspicious." It creates thousands of "fake anomalies." This teaches the AI to look for anything that doesn't fit the current scene, without ever needing a single video of a real crime.

2. The "Super-Brain" Translator (MLLM Integration)

LAVIDA connects a video camera to a super-smart language brain (like a very advanced version of ChatGPT that can see).

  • The Analogy: A normal security camera just sees pixels moving. LAVIDA's brain can read a prompt like, "Find the thing that is hurting someone." It understands that "hurting" means different things in different places. In a kitchen, it might be a knife fight; in a street, it might be a car hitting a pedestrian.
  • The Magic: Because this brain understands language and deep meaning, it can generalize. If it sees a "riot" in a video, it understands the concept of chaos, even if it's never seen a riot before. It doesn't just match patterns; it understands the story.

3. The "Noise-Canceling" Filter (Token Compression)

Videos are huge. A 10-second clip has thousands of pixels, most of which are just boring background (sky, walls, grass). Processing all of that is slow and expensive.

  • The Analogy: Imagine trying to find a needle in a haystack. The old way was to look at every single piece of hay. LAVIDA uses a "Reverse Attention" filter. It quickly identifies the "hay" (the boring background) and ignores it. It focuses only on the "needle" (the weird, moving, suspicious stuff).
  • The Result: It throws away 80% of the visual data that doesn't matter, making the AI faster and cheaper to run, while keeping its eyes glued to the interesting parts.

4. The "Zoom-In" Lens (Multi-Level Mask Decoder)

Finally, LAVIDA doesn't just say, "Something is wrong." It tells you exactly where.

  • The Analogy: It's like a detective who can point to a specific person in a crowd and say, "That guy in the blue shirt is the problem," rather than just pointing at the whole street. It can highlight the exact pixels of the anomaly on the screen.

Why is this a Big Deal?

  • Zero-Shot: It works on scenarios it has never seen before. You don't need to retrain it for every new city or new type of crime.
  • Open-World: It can detect any anomaly, from a falling person to an explosion to a robbery, as long as you can describe it in words.
  • No Real Data Needed: It was trained entirely on "fake" anomalies made from normal pictures. This solves the huge problem of not having enough real crime data to train AI.

In Summary:
LAVIDA is like a security guard who has read every book on human behavior and watched every movie ever made. Instead of memorizing a list of "banned actions," it understands the meaning of a scene. If something looks weird, feels wrong, or breaks the story of the video, it spots it immediately, even if it's a situation the guard has never encountered in real life.