NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

Imagine you are a film director. In the past, video AI was like a toddler with a camera: it could capture a single, cute moment (like a cat jumping), but if you asked it to tell a story with a beginning, middle, and end, it would get confused, forget the plot, or just show you the same scene over and over.

Recently, these AIs have grown up enough to make "long videos." But here's the problem: How do we know if they are actually telling a good story, or just making a long, boring loop?

The paper "NarrLV" introduces a new way to grade these AI directors. Instead of just checking if the video looks pretty, it checks if the AI can actually narrate a story.

Here is the breakdown of their new system, explained with some everyday analogies:

1. The Problem: The "One-Note" Test

Currently, most tests for video AI are like asking a musician to play a single note.

Old Benchmarks: They ask the AI, "Show me a person riding a bike." The AI does it, and the test says, "Good job!"
The Issue: This is too easy. A real story needs more. It needs the person to ride the bike, then fall off, then get up, then call a mechanic. Old tests can't measure if the AI can handle that chain of events. They are like judging a novel by only reading the first sentence.

2. The Solution: The "Story Atom" (TNA)

The authors invented a new unit of measurement called a Temporal Narrative Atom (TNA).

The Metaphor: Think of a TNA as a single "beat" in a song or a single "brick" in a wall.
- Beat 1: The sun is shining.
- Beat 2: The sun sets.
- Beat 3: The moon rises.
If a video has 3 beats, it has 3 TNAs. The more TNAs a video has, the richer and more complex the story is.
The Innovation: NarrLV is the first test that can handle stories with many beats (up to 6 or more), whereas old tests were stuck on just 1 or 2.

3. How They Build the Test: The "Recipe Generator"

To test the AI, they needed thousands of different story prompts. Writing them by hand would take forever.

The Analogy: Imagine a master chef (an AI) who has a giant pantry of ingredients (scenes, objects, actions).
The Process: The researchers built a "Recipe Generator." They tell the generator: "Make me a story about a cat (object) in a kitchen (scene) that involves 3 changes (TNAs)."
The generator automatically creates a prompt like: "A cat sits on a counter. Then, it knocks over a cup. Finally, it runs away."
They can easily ask for stories with 1 change, 5 changes, or even 10 changes, creating a massive, flexible test suite.

4. How They Grade the AI: The "Three-Step Detective"

Once the AI generates a video based on the prompt, how do they grade it? They don't just look at the picture; they use a "Detective AI" (a Multimodal Large Language Model) to ask three specific questions, moving from simple to complex:

Step 1: The Inventory Check (Fidelity)
- Question: "Did the video actually show the cat, the cup, and the kitchen?"
- Analogy: Did the chef use the ingredients you asked for? If you asked for a burger and got a salad, you fail.
Step 2: The Plot Check (Coverage)
- Question: "Did the video show the cat knocking the cup and running away?"
- Analogy: Did the chef cook the whole meal, or did they stop halfway? If the prompt had 3 steps but the video only showed 1, the story is incomplete.
Step 3: The Flow Check (Coherence)
- Question: "Did the cat knock the cup before running away, or did it run away first?"
- Analogy: Is the story logical? If the video shows the cat running away before it knocks the cup, the timeline is broken. The story makes no sense.

5. What They Found: The "Storytelling Ceiling"

They tested many popular video AIs (like Wan, Hunyuan, and others) using this new system. Here is what they discovered:

The "Short-Story" Expert: Most AIs are great at the "Inventory Check." They can easily generate a picture of a cat in a kitchen.
The "Long-Story" Struggle: As the stories got longer (more TNAs), the AIs started to fail the "Plot" and "Flow" checks. They would forget the middle of the story or mix up the order of events.
The "Foundation" Limit: The authors found that a long-video AI is only as good as the "base" AI it was built on. If the base AI can't tell a 3-step story, adding "long video" features won't magically fix it. It's like trying to build a skyscraper on a shaky foundation; no matter how tall you build it, it will wobble.

The Big Takeaway

NarrLV is like a new, stricter film critic. It stops giving passing grades just because the video looks nice. Instead, it asks: "Did you tell the whole story? Did the events happen in the right order? Did you remember the ending?"

This paper tells us that while AI video generation is getting better at making long videos, it still struggles to be a true storyteller. It can paint a picture, but it's still learning how to write a novel.

1. Problem Statement

While foundation video generation models have advanced significantly, long video generation remains a critical challenge. Existing models often focus merely on extending video duration rather than accurately conveying rich narrative content over time.

Evaluation Gap: Current benchmarks (e.g., VBench, TC-Bench) rely on simple prompts with few narrative elements, failing to test a model's ability to handle complex, evolving stories. They often use metrics (FID, FVD) that correlate poorly with human perception of narrative quality.
Need: There is a lack of a standardized benchmark specifically designed to evaluate the narrative expression capabilities of long video generation models, particularly regarding how well they maintain continuity and richness across multiple narrative units.

2. Methodology: The NarrLV Framework

The authors propose NarrLV, a benchmark inspired by film narrative theory. The framework consists of three core components:

A. Theoretical Foundation: Temporal Narrative Atom (TNA)

Definition: The paper introduces the Temporal Narrative Atom (TNA) as the smallest unit of narrative content that maintains continuous visual presentation.
Quantification: The richness of a narrative is quantified by the count of TNAs. A prompt with one continuous action has 1 TNA; a prompt with a sequence of distinct actions or state changes has multiple TNAs.
Drivers of TNA Change: Based on film narratology (6D principles), the authors identify three factors that increase TNA count:
1. Scene Attributes ( $s_{att}$ ): Changes in the environment (e.g., day to night).
2. Object Attributes ( $t_{att}$ ): Changes in object properties (e.g., color change).
3. Object Actions ( $t_{act}$ ): Changes in what objects are doing (e.g., walking $\to$ running).

B. Extensible TNA-Driven Prompt Suite

To overcome the limitations of static prompts, NarrLV employs an automated prompt generation pipeline:

Data Source: Aggregates 200k text prompts from datasets like VideoUFO and DropletVideo to build a comprehensive Scene-Object Pair Set.
Generation Process: Using Large Language Models (LLMs), the system samples scene-object pairs and applies specific TNA change factors to generate prompts with a flexibly expandable number of TNAs (tested up to $N=6$ ).
Scale: The final suite includes 360 evaluation prompts covering 14 major scene categories, 3 change factors, and varying TNA counts (1–6).

C. Progressive Narrative-Expressive Evaluation Metric

Instead of a single score, NarrLV uses a Multi-Modal Large Language Model (MLLM)-based Question Generation and Answering framework to evaluate videos across three progressive dimensions:

Narrative Element Fidelity ( $R_{fid}$ ): Does the video correctly generate the basic elements (scene type, objects, initial layout) described in the prompt?
Narrative Unit Coverage ( $R_{cov}$ ): Does the video contain all the specific narrative units (TNAs) described in the prompt?
Narrative Unit Coherence ( $R_{coh}$ ): Do the transitions between adjacent TNAs occur logically and smoothly over time?

Implementation: For each prompt, the system generates specific binary questions for the MLLM. To handle uncertainty, the MLLM answers each question 5 times, and the proportion of positive answers is used as the score.

3. Key Contributions

Novel Benchmark (NarrLV): The first benchmark dedicated to comprehensively evaluating the narrative expression capabilities of long video generation models, moving beyond simple duration extension.
TNA Concept & Pipeline: Introduction of the "Temporal Narrative Atom" and an automated pipeline that generates prompts with controllable narrative complexity (flexible TNA counts).
Progressive Metric: A three-dimensional evaluation metric ( $R_{fid}, R_{cov}, R_{coh}$ ) grounded in film theory and implemented via MLLM-based QA, which shows high alignment with human judgment.
Comprehensive Evaluation: Extensive testing of both foundation models (e.g., Wan, HunyuanVideo) and specialized long-video models (e.g., FreeNoise, RIFLEx).

4. Experimental Results & Key Findings

The authors evaluated 11 models (5 foundation, 6 long-video extensions) using the NarrLV benchmark.

Narrative Complexity vs. Performance: As the number of TNAs increases, Narrative Unit Coverage ( $R_{cov}$ ) and Coherence ( $R_{coh}$ ) drop significantly, while Element Fidelity ( $R_{fid}$ ) remains relatively stable. This indicates models can generate basic elements but struggle to construct evolving narratives.
Limited Capacity: Current models can effectively express only a very limited number of narrative units (typically $\le 2$ TNAs). The "effective expression count" ( $N_{exp}$ ) plateaus quickly as prompt complexity increases.
Foundation Model Dependency: Long video generation models (e.g., FreeNoise, FreeLong) derived from the same foundation model (VideoCraft) show similar narrative capabilities. While they outperform the base model in unit coverage, their ceiling is largely determined by the foundation model.
Factor Sensitivity: Models perform best on Object Actions for basic element generation but struggle most with Object Actions when it comes to maintaining coherent transitions between multiple actions.
Human Alignment: The proposed metric aligns significantly better with human preferences (Consist-3/3 agreement of ~0.80) compared to existing benchmarks like VBench-2.0 and StoryEval.

5. Significance

Standardization: NarrLV fills a critical gap in the field by providing a standardized, narrative-centric evaluation protocol for long video generation.
Diagnostic Tool: The benchmark reveals specific failure modes (e.g., loss of coherence in multi-step narratives), guiding future research toward better temporal modeling and long-range dependency handling.
Reliability: By leveraging MLLMs for automated, fine-grained evaluation, it offers a scalable alternative to expensive and subjective human annotation while maintaining high correlation with human judgment.
Future Direction: The results suggest that simply extending video duration is insufficient; future models must focus on narrative reasoning and temporal consistency to generate truly long, story-driven videos.