Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Imagine you are a security guard watching a busy city street on a bank of monitors. Your job is to spot anything weird: a person running the wrong way, a car driving on the sidewalk, or two people fighting.

The problem is, there are thousands of hours of video. You can't watch every second. So, you hire a team of AI robots to do the watching for you.

The Old Way: The "Pixel Peepers"

For a long time, AI tried to solve this by looking at the video like a super-strict math teacher. It would look at every single pixel (the tiny dots that make up the image) and try to predict what the next frame should look like.

The Flaw: If a person walks normally, the AI expects the pixels to move in a specific pattern. If they don't, it screams "ALARM!" But this is like a teacher who fails a student just because they wrote their name in blue ink instead of black. The AI gets confused by complex situations, like a dog walking on a leash (normal) vs. a dog dragging a person (weird). It struggles to understand why something is wrong, it just knows the pixels look "off."

The New Way: The "Storyteller" (MLLM-EVAD)

This paper introduces a new method called MLLM-EVAD. Instead of looking at pixels, this system acts like a super-smart storyteller who watches the video and writes a diary entry about what is happening.

Here is how it works, step-by-step:

1. The Detective's Magnifying Glass

First, the system uses a standard "eye" (an object detector) to find people, cars, and dogs in the video. It doesn't just see a blur of color; it knows, "That's a person," and "That's a car."

2. The Time-Traveling Interview

The system doesn't just look at one frozen moment. It picks two moments in time (say, one second apart) and zooms in on pairs of objects that are close to each other.

Analogy: Imagine the AI is a reporter interviewing two people standing next to each other. It asks, "What are you two doing?"

3. The Magic Translator (The MLLM)

This is the secret sauce. The AI sends these zoomed-in pictures to a Multimodal Large Language Model (MLLM). Think of the MLLM as a genius writer who can look at a picture and instantly write a perfect sentence describing it.

Normal Video: The MLLM might write: "A person is walking a dog on a leash along the sidewalk."
Weird Video: The MLLM might write: "A person is pushing a large box containing another person down the street."

4. The "Normal" Library

During the training phase, the system watches hours of normal video. It collects all the sentences the MLLM writes about normal things and builds a Library of Normal Stories.

It saves sentences like: "Two people walking side-by-side," "A car driving down the lane," "A dog running on a leash."
It throws away the duplicates so the library is small and tidy.

5. The "Odd One Out" Test

When the system watches a new video (the test), it asks the MLLM to write a story about what it sees. Then, it compares that new story to the Library of Normal Stories.

If the new story is very similar to the library (e.g., "A person walking"), the system says, "All good."
If the new story is totally different (e.g., "A person pushing a box with a human inside"), the system says, "ALARM! This doesn't match our library!"

Why This is a Game-Changer

1. It Explains Itself (The "Why" Factor)
Old AI systems are like a smoke alarm: it screams "Fire!" but doesn't tell you where or why.
This new system is like a detective who points at the screen and says, "I'm raising an alarm because the story says 'a person is being pushed in a box,' but in our library of normal events, people only walk on sidewalks."
This makes it Explainable. You know exactly why the computer is worried.

2. It Understands Relationships
Old AI struggles with interactions. It sees a person and a car, but doesn't know if they are friends or enemies.
Because this system writes sentences, it understands relationships. It knows the difference between "A person walking next to a car" (normal) and "A person hitting a car" (abnormal).

3. It Works on New Scenes Without Re-Training
Most AI needs to be re-taught every time you move the camera to a new street. This system is smarter. It just needs to watch a few hours of "normal" video at the new location to build its new "Library of Normal Stories." It doesn't need to re-learn how to see; it just needs to learn what "normal" looks like in that specific neighborhood.

The Catch

The only downside is that the "Genius Writer" (the MLLM) is very smart but also very slow and hungry for electricity. It's like hiring a Nobel Prize-winning author to write a grocery list; it's overkill and takes a long time. So, this system is currently better for analyzing recorded footage later, rather than stopping a crime in real-time.

The Bottom Line

This paper proposes a shift from "looking at pixels" to "understanding stories." By turning video into language, the AI can finally understand complex human interactions and explain its decisions in plain English, making it a much more trustworthy tool for security and safety.

1. Problem Definition

The paper addresses Semi-Supervised Video Anomaly Detection (VAD) in a single-scene setting.

Context: The system is trained only on "nominal" (normal) video data from a specific scene. It must identify when and where anomalous events occur in test videos of the same scene.
Challenges:
- Complex Anomalies: Existing methods struggle to detect anomalies defined by object interactions (e.g., a person pushing another, a dog off a leash) rather than just single-object behaviors.
- Lack of Explainability: Most state-of-the-art (SOTA) methods provide anomaly scores but lack natural language explanations for why an event is anomalous.
- Scene Specificity: Multi-scene models often fail in single-scene contexts where "normal" behavior is highly context-dependent (e.g., boxing is normal in a gym but anomalous on a street).

2. Methodology: MLLM-EVAD

The authors propose MLLM-EVAD (Multimodal Large Language Model-based Explainable Video Anomaly Detection). Instead of using raw pixels or low-level features, the method models normality using high-level textual descriptions generated by an MLLM.

Pipeline Overview:

Object Detection & Tracking:
- Uses Detectron2 for detection and ByteTrack for tracking.
- Objects are tracked for 30 frames to establish trajectories.
- Pairing: Objects are paired based on spatial proximity (using a pseudo-depth estimation to calculate 3D distance) to identify potential interactions. Unpaired objects are treated as singletons.
Temporal Sampling & Cropping:
- For each frame $t$ , crops are generated for object pairs (and singletons) at time $t$ and a future time $t+\Delta$ (where $\Delta \approx 30$ frames, or ~1 second).
- This temporal gap captures motion and interaction dynamics while maintaining object identity.
MLLM Querying (Text Generation):
- The cropped images (current and future frames) are fed into an MLLM (e.g., Gemma 3 or GPT-4o).
- Prompting: The MLLM is asked to describe what the objects are doing, specifically focusing on interactions if two objects are present.
- Example Prompt: "Briefly describe what the [object names] in the enclosed regions of these images are doing. The two images were taken one second apart."
- Design Choice: The authors use image-based MLLMs with paired frames rather than video-language models. They found image-based models provide better spatial grounding and sensitivity to fine-grained interactions compared to the coarse, scene-level descriptions of video models.
Exemplar Selection (Model Building):
- Generated text descriptions are converted into vector embeddings using Sentence-BERT.
- An Exemplar Selection Algorithm is applied to the training (nominal) data. It iteratively selects descriptions that are sufficiently distinct (based on cosine distance) from already selected exemplars.
- This creates two compact sets of "normal" exemplars: $E_{pair}$ (for interactions) and $E_{single}$ (for single objects).
Anomaly Detection (Inference):
- Test video frames undergo the same processing to generate text and embeddings.
- Scoring: The anomaly score is calculated as $1 - \max(\text{cosine similarity})$ between the test embedding and the closest exemplar in the corresponding set ( $E_{pair}$ or $E_{single}$ ).
- A high score indicates the test description deviates significantly from all learned normal descriptions.
Hybrid Integration:
- The method can be combined with existing object-based exemplar methods (e.g., Scene-Graph or Tracklet-based methods) by adding the MLLM-generated text as an additional attribute to the feature vector, enhancing detection without retraining deep networks.

3. Key Contributions

First MLLM-based VAD for Interactions: Introduces the first framework specifically designed to detect complex anomalies arising from object interactions using MLLMs.
Novel Paradigm: Shifts from direct frame-level anomaly judgment (common in prior MLLM work) to exemplar-based modeling of normality using textual descriptions. Anomalies are detected as semantic deviations from the learned "normal" text space.
Built-in Explainability: The system inherently provides natural language explanations. By contrasting the anomalous description with the closest normal exemplar, the system explains why an event is flagged (e.g., "Person crouching" vs. "Person walking").
State-of-the-Art Performance: Demonstrates superior performance on interaction-heavy datasets and competitive performance on standard datasets when combined with fine-grained features.

4. Experimental Results

The method was evaluated on three datasets: ComplexVAD (interaction-focused), Avenue, and Street Scene.

ComplexVAD (Interaction Anomalies):
- MLLM-EVAD achieved 24.0% RBDC, 68.0% TBDC, and 61.0% Frame-Level AUC.
- This outperformed the previous best (Scene-Graph) by 5, 4, and 1 percentage points respectively.
- Hybrid Approach: Combining MLLM-EVAD with the Scene-Graph method yielded the best results: 25.0% RBDC, 70.0% TBDC, and 63.0% Frame-Level AUC.
- Note: The method supports spatial localization (RBDC/TBDC), unlike some prior MLLM methods (e.g., AnomalyRuler) which only offered frame-level scores.
Avenue & Street Scene:
- When combined with the Tracklet EVAL method, MLLM-EVAD improved SOTA results on the Avenue dataset for both RBDC and TBDC, and improved TBDC on Street Scene.
- This confirms that high-level semantic descriptions complement fine-grained attributes (speed, direction) effectively.
Ablation Studies:
- MLLM Choice: Gemma 3 outperformed GPT-4o (24% vs 19% RBDC), attributed to Gemma 3's ability to generate more detailed and descriptive sentences crucial for interaction detection.
- Distance Metric: Sentence-BERT embeddings with cosine similarity performed best overall, though BLEU/METEOR showed slight gains in specific metrics.
Explainability Evaluation:
- A human evaluation (5-point Likert scale) rated MLLM-EVAD explanations at 3.8 ± 1.1, closely matching human-written annotations (4.2 ± 0.7), validating the interpretability of the system.

5. Significance and Impact

Interpretability in Surveillance: The work bridges the gap between high-accuracy detection and human-understandable reasoning, a critical requirement for security applications where operators need to verify alerts.
Handling Complex Scenarios: By modeling interactions via language, the method overcomes the limitations of pixel-based or trajectory-based models that fail to understand the context of an interaction (e.g., distinguishing a playful hug from an assault).
Efficiency in Training: The exemplar-based approach avoids the need to retrain deep neural networks for every new scene, making it adaptable to new surveillance locations with minimal computational overhead during the training phase.
Future Directions: The authors acknowledge computational latency as a limitation for real-time use and suggest future work in fine-tuning smaller, specialized models and integrating open-vocabulary object detection to handle unseen object categories.

In summary, this paper presents a robust, explainable framework that leverages the semantic reasoning capabilities of Multimodal LLMs to detect and describe complex video anomalies, setting a new standard for interaction-based detection in semi-supervised settings.