FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

Imagine you are the head of security for a busy city. You have thousands of hours of CCTV footage, and your job is to spot anything weird happening—a fight, a theft, a fire.

For a long time, computers were good at just shouting, "Something is wrong!" but they couldn't tell you what was wrong, who was involved, or where it happened. They were like a guard who sees a shadow and screams "Intruder!" without looking closer.

Recently, we got super-smart AI cameras (called LVLMs) that can actually describe what they see in sentences. But here's the problem: How do we grade them?

The Problem: The "Spelling Bee" vs. The "Detective"

The old way of testing these AI cameras was like a Spelling Bee.

If a human wrote, "A man stole a watch," and the AI wrote, "A guy took a timepiece," the old computer grading system would give the AI a bad grade because the words didn't match exactly.
If the AI wrote a beautiful, poetic sentence that was completely factually wrong (e.g., "A dragon flew over the bank"), the old system might give it a high score just because the grammar was perfect.

This is like hiring a detective who writes a beautiful report saying, "The butler did it with a candlestick," when the butler was actually asleep in another room. The report sounds great, but it's useless for catching the criminal.

The Solution: FineVAU

The authors of this paper built a new testing ground called FineVAU. Think of it as a Detective's Checklist instead of a spelling test.

They realized that to truly understand an anomaly (a weird event), you need to answer three specific questions, just like a human detective would:

What happened? (The Event)
Who was involved? (The People/Objects)
Where did it happen? (The Location)

They call this the "What, Who, Where" framework.

The New Tool: FV-Score

To grade the AI, they created a new metric called FV-Score.

Old Way: "Did you use the word 'fight'?"
New Way (FV-Score): "Did you notice the two men fighting? Did you mention they were wearing red and blue shirts? Did you say it happened in the parking lot?"

If the AI misses the "Who" (the shirts) or the "Where" (the parking lot), it gets a lower score, even if the sentence is grammatically perfect. It forces the AI to be a fact-checker, not just a poet.

The New Dataset: FineW³

To test this, they needed a massive library of videos with perfect "Detective Checklists" attached to them. They built a dataset called FineW³.

They took existing videos and used a super-smart AI to break them down into tiny details.
Instead of just "A fight," the data now says: "Two men (one with a beard, one in a suit) are fighting near a fountain at night."
This is like turning a blurry photo into a high-definition 3D map where every detail is labeled.

What They Discovered (The Plot Twist)

When they tested the world's best AI cameras on this new, strict test, they found some surprising weaknesses:

Good at Static Stuff: The AI is great at saying, "This is a street," or "There is a car." It's like a tourist who can describe the scenery perfectly.
Bad at the Action: The AI struggles to describe the action. It often misses small, fast things. If someone steals a wallet in 2 seconds, the AI might miss it entirely.
The "Normalcy Bias": This is the funniest and scariest part. The AI is so used to seeing normal things that it often hallucinates.
- Real Life: Two guys are fighting.
- AI Report: "Two guys are having a friendly chat."
- The AI is so polite and biased toward "normal" that it ignores the violence! It's like a security guard who assumes everyone is just saying hello, even when someone is being punched.

The Bottom Line

This paper is a wake-up call. It says: "Stop grading AI on how well it writes; start grading it on how well it sees."

They built a new, stricter test (FineVAU) and a new dataset (FineW³) to prove that while our AI cameras are getting smarter at describing the background, they are still terrible at spotting the actual crime. To build a truly safe future, we need AI that doesn't just write pretty sentences, but actually notices the details that matter.

1. Problem Statement

Video Anomaly Understanding (VAU) aims to describe unusual occurrences in videos with rich, free-form language. While Large Vision-Language Models (LVLMs) have advanced this field, evaluation remains a critical bottleneck. Current benchmarks suffer from two main flaws:

N-gram Metrics (e.g., BLEU, ROUGE-L): These measure lexical overlap rather than factual accuracy. They fail to capture the visual grounding of anomalies and penalize factually correct answers that use different wording.
LLM-based Metrics: Existing LLM judges often prioritize textual fluency and coherence over factual relevance. They produce subjective scores that are misaligned with human perception, failing to detect when a model generates a coherent but factually incorrect description of an anomaly.

There is a lack of benchmarks that evaluate the fine-grained, structural understanding of anomalies (specifically What happened, Who was involved, and Where it occurred) in a way that correlates with human judgment.

2. Methodology

The authors propose FineVAU, a comprehensive framework consisting of a new dataset, a structured problem formulation, and a novel evaluation metric.

A. Problem Formulation: The "What, Who, Where" Framework

The authors reformulate VAU as a three-dimensional problem to align with human perception of anomalies:

What (Events): Key actions, interactions, and state changes (e.g., "fighting," "explosion").
Who (Entities): Involved actors/objects and their salient visual attributes (e.g., clothing, age, color, brand).
Where (Location): Scene context, including environment, time of day, lighting, and crowd density.

B. Dataset: FineW³

To support this framework, the authors created FineW³, a high-quality dataset containing 1,544 videos (sourced from UCF-Crime and other CCTV footage).

Annotation Pipeline: A fully automated, two-stage pipeline using Gemini-2.5-Pro to augment existing human-labeled data:
- Stage 1 (Event Decomposition & Linking): Decomposes complex sentences into atomic, causally linked events and links them to unique entity identifiers.
- Stage 2 (Grounding & Description): Augments entities with fine-grained physical attributes and describes the scene's physical properties.
Scale: The dataset includes ~17,800 events, ~59,000 entities, and ~74,000 attributes, featuring long-duration videos (up to 1 hour) with dense annotations.

C. Evaluation Metric: FV-Score

The core contribution is FV-Score, a novel metric that replaces set membership with a semantic-aware scoring function.

Mechanism: It uses an LLM judge (FineVAU-Judge, powered by Gemini-2.5-Flash) to compare the model's generated report ( $R$ ) against the structured ground truth ( $G$ ).
Scoring Scale:
- Binary (Who, Where): 0 (Missing/Incorrect) or 1 (Present/Correct).
- Ternary (What): 0 (Missing), 0.5 (Partial/Minor errors), or 1 (Accurate/Complete).
Calculation: The final score is a weighted sum of the scores across the three dimensions ( $S(R) = \lambda_{what}J_{what} + \lambda_{who}J_{who} + \lambda_{where}J_{where}$ ).
Human Alignment: Through ablation studies, the authors determined that weighting Entities (Who) higher ( $\lambda_{who}=2.0$ ) yields the strongest correlation with human judgment.

3. Key Contributions

FineVAU Benchmark: The first benchmark specifically designed for fine-grained, human-aligned VAU evaluation, decomposing anomalies into Events, Entities, and Location.
FV-Score: A novel, interpretable metric that assesses the presence of critical visual elements rather than just linguistic fluency, showing superior correlation with human perception compared to n-gram and standard LLM-based metrics.
FineW³ Dataset: A systematically curated dataset that enriches existing anomaly descriptions with verifiable, fine-grained visual information via an automated LVLM pipeline.
Comprehensive Evaluation: Extensive testing on five state-of-the-art LVLMs (Qwen2.5-VL, InternVL3, VideoLLaMA3, LLaVA-VID, LLaVA-OneVision).

4. Experimental Results

The evaluation of five SOTA LVLMs on FineVAU revealed critical limitations in current models:

Static vs. Dynamic Understanding: Models perform well on static information (Location: ~61% accuracy, Entity identification: ~40% accuracy) but struggle significantly with dynamic events (Event accuracy: ~12%).
Spatial and Temporal Blind Spots: Models fail to capture fine-grained details in small spatial/temporal windows. They excel at obvious anomalies with strong visual cues (e.g., explosions, fires, arrests) but fail at subtle anomalies requiring behavioral inference (e.g., shoplifting, subtle struggles).
Bias Towards Normalcy: Models exhibit a strong bias toward describing normal events even when anomalies are present (hallucinating "conversation" instead of "fight"). They rarely hallucinate anomalies in normal videos, suggesting a failure to recall the specific granular details of abnormal events rather than a general tendency to invent them.
Metric Superiority: FV-Score achieved the highest correlation with human rankings (Pearson $\rho = 0.61$ , Kendall's $\tau = 0.56$ ), significantly outperforming n-gram baselines (e.g., ROUGE-L $\rho = 0.47$ ) and existing LLM judges.
Model Performance: InternVL3 achieved the best overall performance, though all models showed substantial gaps in event detection. Interestingly, models with smaller context windows (LLaVA variants) sometimes outperformed larger context models in event detection, highlighting a disconnect between static scene understanding and anomaly reasoning.

5. Significance

Paradigm Shift in Evaluation: FineVAU moves the field away from "language fluency" metrics toward "visual grounding" and "factual accuracy," providing a more reliable measure of a model's true capability to understand video anomalies.
Diagnostic Tool: The benchmark exposes specific weaknesses in current LVLMs, particularly their inability to reason about subtle, fine-grained temporal dynamics and their bias toward normalcy.
Future Directions: The structured nature of FineW³ and the interpretability of FV-Score provide a clear path for training targeted models that can mitigate hallucinations and achieve robust, factual video anomaly understanding. The authors argue that pairing such structured data with rigorous benchmarks is essential for the next generation of VAU systems.