Language-guided Open-world Video Anomaly Detection under Weak Supervision

This paper introduces LaGoVAD, a novel language-guided open-world video anomaly detection framework that dynamically adapts to variable anomaly definitions via natural language prompts under weak supervision, supported by the newly proposed PreVAD dataset and validated by state-of-the-art zero-shot performance across seven benchmarks.

Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a security guard watching a live feed of a busy city street. Your job is to spot anything "wrong."

In the old days, security systems were like rigid robots. You programmed them with a fixed list of rules: "If you see a fire, scream. If you see a fight, scream." But what happens if the rules change?

  • Scenario A: It's a flu outbreak. The robot sees someone without a mask. It stays silent because "no mask" isn't on its "bad list."
  • Scenario B: It's a normal day. The robot sees a person running. It screams "ALARM!" because running looks like a chase.

The problem is that what counts as "abnormal" changes depending on the situation. A robot with a fixed list of rules can't handle this. This is what the paper calls "Concept Drift."

The New Solution: The "Smart Assistant" Guard

The authors propose a new system called LaGoVAD (Language-guided Open-world Video Anomaly Detector). Instead of a rigid robot, imagine a highly intelligent security guard who can talk to you.

Here is how it works, using simple analogies:

1. The "Magic Prompt" (Language Guidance)

Instead of hard-coding rules, you can simply talk to the system.

  • You say: "Today, I'm worried about people running in the library."
  • The System: "Got it. I will now flag anyone running in the library as an anomaly."
  • Later, you say: "Actually, today I only care about people stealing."
  • The System: "Understood. I will ignore running and focus only on theft."

This allows the system to adapt instantly to new rules without needing to be retrained or reprogrammed. It treats the definition of "bad" as a variable that you can change on the fly.

2. The "Giant Library" (The PreVAD Dataset)

To teach this guard to understand your changing rules, you need to show it a massive amount of examples. Existing datasets were like small, dusty libraries with only a few books on "crime" or "traffic."

The authors built PreVAD, which is like a massive, modern digital library containing over 35,000 videos.

  • Diversity: It has videos of car crashes, animal attacks, factory accidents, and daily mishaps.
  • Descriptions: Unlike old datasets that just said "Bad Video," this one has detailed stories for every video (e.g., "A forklift fell into a hole in the warehouse").
  • Why it matters: Because the guard has read so many different stories, it can understand the concept of an accident, not just memorize specific pictures. This helps it recognize new types of problems it has never seen before.

3. The "Training Drills" (Regularization Strategies)

Teaching a computer to understand both video and language is hard. It's like trying to teach a dog to understand both a hand signal and a spoken command at the same time. The dog might get confused and just guess.

To prevent this, the authors used two special training drills:

  • Drill A: The "Time-Travel" Simulator (Dynamic Video Synthesis)
    In real life, bad things usually happen for just a few seconds in a long video. But old training data often had videos where the "bad part" was the whole video.

    • The Fix: The system artificially stitches together video clips to create fake scenarios. It might take a 10-second clip of a crash and insert it into a 5-minute video of a calm street. This teaches the system to spot the "needle in the haystack" and understand that bad things can be short or long.
  • Drill B: The "Spot the Difference" Game (Contrastive Learning)
    Sometimes, a video looks almost normal but has a tiny flaw. The system needs to learn the difference between "almost good" and "actually bad."

    • The Fix: The system is shown pairs of videos and forced to compare them. It learns to say, "This video looks like a robbery, but this one is just a movie scene." It learns to ignore the "fake" bad things and focus on the real ones.

The Result: A Super-Adaptable Guard

When the authors tested this new guard against seven different real-world scenarios (from crime scenes to traffic jams), it didn't just perform well; it crushed the competition.

  • Old Systems: "I only know how to detect explosions. If you show me a fire, I'm confused."
  • LaGoVAD: "You told me to look for fire? I see it right there. You want me to look for running instead? Done."

Summary

This paper introduces a new way to watch videos where you (the human) get to decide what is "weird" by simply typing a sentence. By building a giant library of examples and training the AI with smart drills, they created a system that doesn't just memorize rules—it understands the concept of an anomaly and adapts to your needs instantly.

It's the difference between a stuck record that plays the same song forever and a Spotify DJ that can instantly switch genres based on what you ask for.