Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

This paper proposes a novel semi-supervised video anomaly detection framework that leverages Multimodal Large Language Models to generate and compare high-level textual descriptions of object interactions, thereby achieving state-of-the-art performance on complex anomalies while providing inherent explainability.

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are a security guard watching a busy city street on a bank of monitors. Your job is to spot anything weird: a person running the wrong way, a car driving on the sidewalk, or two people fighting.

The problem is, there are thousands of hours of video. You can't watch every second. So, you hire a team of AI robots to do the watching for you.

The Old Way: The "Pixel Peepers"

For a long time, AI tried to solve this by looking at the video like a super-strict math teacher. It would look at every single pixel (the tiny dots that make up the image) and try to predict what the next frame should look like.

  • The Flaw: If a person walks normally, the AI expects the pixels to move in a specific pattern. If they don't, it screams "ALARM!" But this is like a teacher who fails a student just because they wrote their name in blue ink instead of black. The AI gets confused by complex situations, like a dog walking on a leash (normal) vs. a dog dragging a person (weird). It struggles to understand why something is wrong, it just knows the pixels look "off."

The New Way: The "Storyteller" (MLLM-EVAD)

This paper introduces a new method called MLLM-EVAD. Instead of looking at pixels, this system acts like a super-smart storyteller who watches the video and writes a diary entry about what is happening.

Here is how it works, step-by-step:

1. The Detective's Magnifying Glass

First, the system uses a standard "eye" (an object detector) to find people, cars, and dogs in the video. It doesn't just see a blur of color; it knows, "That's a person," and "That's a car."

2. The Time-Traveling Interview

The system doesn't just look at one frozen moment. It picks two moments in time (say, one second apart) and zooms in on pairs of objects that are close to each other.

  • Analogy: Imagine the AI is a reporter interviewing two people standing next to each other. It asks, "What are you two doing?"

3. The Magic Translator (The MLLM)

This is the secret sauce. The AI sends these zoomed-in pictures to a Multimodal Large Language Model (MLLM). Think of the MLLM as a genius writer who can look at a picture and instantly write a perfect sentence describing it.

  • Normal Video: The MLLM might write: "A person is walking a dog on a leash along the sidewalk."
  • Weird Video: The MLLM might write: "A person is pushing a large box containing another person down the street."

4. The "Normal" Library

During the training phase, the system watches hours of normal video. It collects all the sentences the MLLM writes about normal things and builds a Library of Normal Stories.

  • It saves sentences like: "Two people walking side-by-side," "A car driving down the lane," "A dog running on a leash."
  • It throws away the duplicates so the library is small and tidy.

5. The "Odd One Out" Test

When the system watches a new video (the test), it asks the MLLM to write a story about what it sees. Then, it compares that new story to the Library of Normal Stories.

  • If the new story is very similar to the library (e.g., "A person walking"), the system says, "All good."
  • If the new story is totally different (e.g., "A person pushing a box with a human inside"), the system says, "ALARM! This doesn't match our library!"

Why This is a Game-Changer

1. It Explains Itself (The "Why" Factor)
Old AI systems are like a smoke alarm: it screams "Fire!" but doesn't tell you where or why.
This new system is like a detective who points at the screen and says, "I'm raising an alarm because the story says 'a person is being pushed in a box,' but in our library of normal events, people only walk on sidewalks."
This makes it Explainable. You know exactly why the computer is worried.

2. It Understands Relationships
Old AI struggles with interactions. It sees a person and a car, but doesn't know if they are friends or enemies.
Because this system writes sentences, it understands relationships. It knows the difference between "A person walking next to a car" (normal) and "A person hitting a car" (abnormal).

3. It Works on New Scenes Without Re-Training
Most AI needs to be re-taught every time you move the camera to a new street. This system is smarter. It just needs to watch a few hours of "normal" video at the new location to build its new "Library of Normal Stories." It doesn't need to re-learn how to see; it just needs to learn what "normal" looks like in that specific neighborhood.

The Catch

The only downside is that the "Genius Writer" (the MLLM) is very smart but also very slow and hungry for electricity. It's like hiring a Nobel Prize-winning author to write a grocery list; it's overkill and takes a long time. So, this system is currently better for analyzing recorded footage later, rather than stopping a crime in real-time.

The Bottom Line

This paper proposes a shift from "looking at pixels" to "understanding stories." By turning video into language, the AI can finally understand complex human interactions and explain its decisions in plain English, making it a much more trustworthy tool for security and safety.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →