Language-guided Open-world Video Anomaly Detection under Weak Supervision

Imagine you are a security guard watching a live feed of a busy city street. Your job is to spot anything "wrong."

In the old days, security systems were like rigid robots. You programmed them with a fixed list of rules: "If you see a fire, scream. If you see a fight, scream." But what happens if the rules change?

Scenario A: It's a flu outbreak. The robot sees someone without a mask. It stays silent because "no mask" isn't on its "bad list."
Scenario B: It's a normal day. The robot sees a person running. It screams "ALARM!" because running looks like a chase.

The problem is that what counts as "abnormal" changes depending on the situation. A robot with a fixed list of rules can't handle this. This is what the paper calls "Concept Drift."

The New Solution: The "Smart Assistant" Guard

The authors propose a new system called LaGoVAD (Language-guided Open-world Video Anomaly Detector). Instead of a rigid robot, imagine a highly intelligent security guard who can talk to you.

Here is how it works, using simple analogies:

1. The "Magic Prompt" (Language Guidance)

Instead of hard-coding rules, you can simply talk to the system.

You say: "Today, I'm worried about people running in the library."
The System: "Got it. I will now flag anyone running in the library as an anomaly."
Later, you say: "Actually, today I only care about people stealing."
The System: "Understood. I will ignore running and focus only on theft."

This allows the system to adapt instantly to new rules without needing to be retrained or reprogrammed. It treats the definition of "bad" as a variable that you can change on the fly.

2. The "Giant Library" (The PreVAD Dataset)

To teach this guard to understand your changing rules, you need to show it a massive amount of examples. Existing datasets were like small, dusty libraries with only a few books on "crime" or "traffic."

The authors built PreVAD, which is like a massive, modern digital library containing over 35,000 videos.

Diversity: It has videos of car crashes, animal attacks, factory accidents, and daily mishaps.
Descriptions: Unlike old datasets that just said "Bad Video," this one has detailed stories for every video (e.g., "A forklift fell into a hole in the warehouse").
Why it matters: Because the guard has read so many different stories, it can understand the concept of an accident, not just memorize specific pictures. This helps it recognize new types of problems it has never seen before.

3. The "Training Drills" (Regularization Strategies)

Teaching a computer to understand both video and language is hard. It's like trying to teach a dog to understand both a hand signal and a spoken command at the same time. The dog might get confused and just guess.

To prevent this, the authors used two special training drills:

Drill A: The "Time-Travel" Simulator (Dynamic Video Synthesis)
In real life, bad things usually happen for just a few seconds in a long video. But old training data often had videos where the "bad part" was the whole video.
- The Fix: The system artificially stitches together video clips to create fake scenarios. It might take a 10-second clip of a crash and insert it into a 5-minute video of a calm street. This teaches the system to spot the "needle in the haystack" and understand that bad things can be short or long.
Drill B: The "Spot the Difference" Game (Contrastive Learning)
Sometimes, a video looks almost normal but has a tiny flaw. The system needs to learn the difference between "almost good" and "actually bad."
- The Fix: The system is shown pairs of videos and forced to compare them. It learns to say, "This video looks like a robbery, but this one is just a movie scene." It learns to ignore the "fake" bad things and focus on the real ones.

The Result: A Super-Adaptable Guard

When the authors tested this new guard against seven different real-world scenarios (from crime scenes to traffic jams), it didn't just perform well; it crushed the competition.

Old Systems: "I only know how to detect explosions. If you show me a fire, I'm confused."
LaGoVAD: "You told me to look for fire? I see it right there. You want me to look for running instead? Done."

Summary

This paper introduces a new way to watch videos where you (the human) get to decide what is "weird" by simply typing a sentence. By building a giant library of examples and training the AI with smart drills, they created a system that doesn't just memorize rules—it understands the concept of an anomaly and adapts to your needs instantly.

It's the difference between a stuck record that plays the same song forever and a Spotify DJ that can instantly switch genres based on what you ask for.

1. Problem Definition: Open-World VAD and Concept Drift

The paper addresses a critical limitation in existing Video Anomaly Detection (VAD) systems: the assumption that the definition of "anomaly" is static.

The Challenge: In real-world "open-world" scenarios, what constitutes an anomaly changes based on context, user requirements, or temporal policies (e.g., a person running is normal in a park but abnormal in a library; not wearing a mask is normal in most places but abnormal during a flu outbreak).
Concept Drift: This phenomenon, where the conditional probability distribution of labels changes between training and testing ( $P_{train}(Y|V) \neq P_{test}(Y|V)$ ), causes traditional models to fail. Existing "open-set" or "open-vocabulary" methods can detect new anomaly types but cannot handle cases where a previously "normal" behavior is redefined as "abnormal" by the user.
Data Scarcity: Existing VAD datasets lack semantic descriptions of anomalies, making it difficult to train models that can understand and adapt to natural language definitions.

2. Methodology: LaGoVAD

The authors propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a paradigm shift that models anomaly detection as a function of both the video ( $V$ ) and a user-provided anomaly definition ( $Z$ ), i.e., $\Phi: (V, Z) \rightarrow Y$ .

Core Architecture

Inputs: A video $v$ and a textual anomaly definition $z$ (either a class name or a detailed description).
Encoders:
- Visual: A pretrained CLIP image encoder combined with a Transformer-based temporal encoder.
- Textual: A CLIP text encoder.
Fusion: A Transformer-based fusion module aligns visual and textual features.
Heads:
- Binary Detection Head: Outputs frame-level anomaly scores.
- Multi-class Classification Head: Outputs probabilities for specific anomaly categories.

Key Regularization Strategies

To prevent overfitting in the complex multimodal space and handle data biases, LaGoVAD employs two novel strategies:

Dynamic Video Synthesis ( $L_{dvs}$ ):
- Problem: Real-world anomalies are often short events within long videos, but web-sourced datasets often have high anomaly ratios (edited clips).
- Solution: The model dynamically synthesizes videos by concatenating semantically similar segments (using K-Nearest Neighbors on CLIP features) to create videos with varying anomaly durations. It generates pseudo-labels for these synthetic videos to train the model on diverse temporal patterns and normal contexts.
Contrastive Learning with Hard Negative Mining ( $L_{neg}$ ):
- Problem: The boundary between normal and abnormal frames is often ambiguous.
- Solution: The model aggregates frame-level features weighted by anomaly scores to create "foreground" (abnormal) and "background" (normal) representations. It uses contrastive learning to align the video features with the text definition while treating the normal parts of an abnormal video as "hard negatives" to refine the decision boundary.

Training Objective

The model is trained under weak supervision using a combined loss function:
$L = L_{MIL} + L_{MIL-align} + L_{dvs} + L_{neg}$
Where $L_{MIL}$ and $L_{MIL-align}$ handle standard temporal detection and classification, while the new terms enforce robustness and diversity.

3. Key Contributions

A. Theoretical Paradigm Shift

The authors formalize the Definition-determined Abnormality Assumption: The anomaly label $Y$ is solely determined by the video $V$ and the definition $Z$ . By conditioning predictions on $Z$ , the model theoretically eliminates concept drift because $P(Y|V, Z)$ remains invariant even if $P(Y|V)$ shifts.

B. PreVAD Dataset

To support this paradigm, the authors constructed PreVAD, the largest and most diverse video anomaly dataset to date:

Scale: 35,279 annotated videos (209.5 hours), with 11,979 abnormal and 23,300 normal videos.
Diversity: Covers 7 top-level categories (Violence, Vehicle Accident, Fire, Robbery, Daily Accident, Animal Violence, Production Accident) and 35 subcategories.
Annotations: Unlike previous datasets, PreVAD includes fine-grained text descriptions for every anomaly, enabling the model to learn from natural language definitions.
Curation: Built via a scalable pipeline using foundation models (LLMs/MLLMs) for automated cleaning, captioning, and verification, significantly reducing manual labeling costs.

C. Evaluation Protocols

The paper introduces two zero-shot evaluation protocols:

Cross-Domain Generalization: Testing on 7 diverse datasets (UCF-Crime, XD-Violence, MSAD, etc.) to verify performance on unseen categories and domains.
Concept Drift Simulation: Testing on a single dataset with varying anomaly definitions (e.g., treating only a subset of categories as abnormal) to measure robustness to label changes.

4. Experimental Results

LaGoVAD achieves State-of-the-Art (SOTA) performance across all benchmarks:

Zero-Shot Detection: Outperforms existing weakly-supervised, open-vocabulary, and domain-generalization methods. For example, on the XD-Violence dataset, it improved detection performance by 20% and classification by 32% compared to the previous best.
Concept Drift Robustness: In the "drift@5" protocol (averaging performance over 5 different anomaly definitions), LaGoVAD significantly outperformed LLM-based methods (like Qwen2-VL) and multi-modal baselines, demonstrating superior adaptability to changing user requirements.
Ablation Studies: Removing either the dynamic synthesis or the contrastive loss resulted in significant performance drops, confirming the necessity of both regularization strategies.

5. Significance

Practical Applicability: This work bridges the gap between academic VAD research and real-world deployment, where surveillance requirements are dynamic and user-defined.
Data-Centric AI: The creation of PreVAD sets a new standard for dataset quality in VAD, emphasizing the importance of semantic descriptions over simple category labels.
Efficiency: Unlike massive LLM-based approaches that require heavy computational resources, LaGoVAD achieves superior performance with a lightweight architecture, making it feasible for real-time deployment.
Theoretical Foundation: It provides a rigorous mathematical framework for handling concept drift in anomaly detection, moving the field beyond static definitions.

In summary, LaGoVAD represents a fundamental shift from "detecting known anomalies" to "detecting user-defined anomalies," solving the concept drift problem through language guidance, robust data synthesis, and a large-scale, semantically rich dataset.