FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

This paper introduces FineVAU, a novel benchmark comprising the FineW3 dataset and the human-aligned FVScore metric, to address the limitations of existing evaluation methods by enabling fine-grained, domain-specific assessment of Large Vision-Language Models' ability to describe anomalous video events, entities, and locations.

João Pereira, Vasco Lopes, João Neves, David Semedo

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are the head of security for a busy city. You have thousands of hours of CCTV footage, and your job is to spot anything weird happening—a fight, a theft, a fire.

For a long time, computers were good at just shouting, "Something is wrong!" but they couldn't tell you what was wrong, who was involved, or where it happened. They were like a guard who sees a shadow and screams "Intruder!" without looking closer.

Recently, we got super-smart AI cameras (called LVLMs) that can actually describe what they see in sentences. But here's the problem: How do we grade them?

The Problem: The "Spelling Bee" vs. The "Detective"

The old way of testing these AI cameras was like a Spelling Bee.

  • If a human wrote, "A man stole a watch," and the AI wrote, "A guy took a timepiece," the old computer grading system would give the AI a bad grade because the words didn't match exactly.
  • If the AI wrote a beautiful, poetic sentence that was completely factually wrong (e.g., "A dragon flew over the bank"), the old system might give it a high score just because the grammar was perfect.

This is like hiring a detective who writes a beautiful report saying, "The butler did it with a candlestick," when the butler was actually asleep in another room. The report sounds great, but it's useless for catching the criminal.

The Solution: FineVAU

The authors of this paper built a new testing ground called FineVAU. Think of it as a Detective's Checklist instead of a spelling test.

They realized that to truly understand an anomaly (a weird event), you need to answer three specific questions, just like a human detective would:

  1. What happened? (The Event)
  2. Who was involved? (The People/Objects)
  3. Where did it happen? (The Location)

They call this the "What, Who, Where" framework.

The New Tool: FV-Score

To grade the AI, they created a new metric called FV-Score.

  • Old Way: "Did you use the word 'fight'?"
  • New Way (FV-Score): "Did you notice the two men fighting? Did you mention they were wearing red and blue shirts? Did you say it happened in the parking lot?"

If the AI misses the "Who" (the shirts) or the "Where" (the parking lot), it gets a lower score, even if the sentence is grammatically perfect. It forces the AI to be a fact-checker, not just a poet.

The New Dataset: FineW³

To test this, they needed a massive library of videos with perfect "Detective Checklists" attached to them. They built a dataset called FineW³.

  • They took existing videos and used a super-smart AI to break them down into tiny details.
  • Instead of just "A fight," the data now says: "Two men (one with a beard, one in a suit) are fighting near a fountain at night."
  • This is like turning a blurry photo into a high-definition 3D map where every detail is labeled.

What They Discovered (The Plot Twist)

When they tested the world's best AI cameras on this new, strict test, they found some surprising weaknesses:

  1. Good at Static Stuff: The AI is great at saying, "This is a street," or "There is a car." It's like a tourist who can describe the scenery perfectly.
  2. Bad at the Action: The AI struggles to describe the action. It often misses small, fast things. If someone steals a wallet in 2 seconds, the AI might miss it entirely.
  3. The "Normalcy Bias": This is the funniest and scariest part. The AI is so used to seeing normal things that it often hallucinates.
    • Real Life: Two guys are fighting.
    • AI Report: "Two guys are having a friendly chat."
    • The AI is so polite and biased toward "normal" that it ignores the violence! It's like a security guard who assumes everyone is just saying hello, even when someone is being punched.

The Bottom Line

This paper is a wake-up call. It says: "Stop grading AI on how well it writes; start grading it on how well it sees."

They built a new, stricter test (FineVAU) and a new dataset (FineW³) to prove that while our AI cameras are getting smarter at describing the background, they are still terrible at spotting the actual crime. To build a truly safe future, we need AI that doesn't just write pretty sentences, but actually notices the details that matter.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →