VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

Imagine you are taking a very difficult history test, but there's a catch: you are only allowed to look at 5 random pages of the textbook instead of reading the whole chapter.

If the answer to the question is on page 100, but you only saw pages 1, 5, 10, 15, and 20, you have no idea what the answer is.

The Honest Student: Says, "I didn't read those pages, so I don't know the answer."
The Guesser: Takes a wild guess. Maybe they guess right by luck, maybe they guess wrong.

The Problem:
Current AI tests (benchmarks) are like a strict teacher who punishes the Honest Student for saying "I don't know" and rewards the Guesser if they happen to get the lucky guess right. This tricks us into thinking the AI is smart when it's actually just gambling.

The Solution: VirtueBench
The authors of this paper created a new test called VirtueBench. Think of it as a "Character Test" for AI. Instead of just asking "Did you get the right answer?", it asks: "Did you know when you shouldn't have answered?"

Here is how they built it, using some fun analogies:

1. The "Zoom Lens" Experiment

Imagine you have a long movie.

Old Way: You show the AI a few blurry snapshots (frames) from the movie and ask a question. If the AI guesses, it gets points.
VirtueBench Way: They take the same movie and show the AI different "zoom levels."
- Level 1: 64 snapshots (very blurry, missing the key scene).
- Level 2: 128 snapshots.
- Level 3: 1,024 snapshots (almost the whole movie).

They then create a "Answer Key" for every single level.

If the key scene is missing in the 64-snapshot version, the "correct" answer is "I don't know."
If the key scene is there in the 1,024-snapshot version, the "correct" answer is the actual fact.

2. The "Honesty vs. Bravery" Score

In this new test, the AI gets points for two things:

Accuracy: Getting the answer right when it can see the evidence.
Virtue (Refusal): Saying "I don't know" when the evidence is missing.

If an AI guesses the answer when it can't see the evidence, it gets a zero, even if it guesses the right word by accident. This stops the AI from "gambling" its way to a high score.

3. What They Found (The Plot Twist)

The researchers tested 25 different AI models (like Qwen, LLaVA, GPT-4o, etc.) and found some surprising things:

The "Silent" AI: Many popular models are terrible at admitting they don't know. They are like students who will write something on the test even if they haven't studied, hoping to get lucky. Some of these models had a "refusal rate" of nearly 0% (they never said "I don't know").
The "Honest" AI: A few models (like the newer Qwen and Gemini versions) were much better. They would look at the blurry snapshots, realize the info was missing, and politely say, "I can't answer this."
The "Prompt" Trap: When the researchers removed the instruction telling the AI to "be honest," the good models suddenly got worse. It turns out, these AIs are naturally wired to be "people-pleasers" who want to give an answer, even a bad one, unless you explicitly tell them, "It's okay to say you don't know."

Why This Matters

Imagine you are a doctor using an AI to diagnose a patient.

Old Benchmark: The AI guesses a disease. If it's right, you think the AI is a genius. If it's wrong, you think it's just a fluke.
VirtueBench: The AI sees the symptoms are unclear. Instead of guessing a disease and risking a wrong treatment, it says, "I need more tests."

The Bottom Line:
This paper argues that we need to stop praising AI for guessing and start praising it for knowing its limits. VirtueBench is the tool we need to build AI that is not just "smart," but also trustworthy. It's about moving from AI that "acts like it knows everything" to AI that "knows when it doesn't know."

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

1. The "Zoom Lens" Experiment

2. The "Honesty vs. Bravery" Score

3. What They Found (The Plot Twist)

Why This Matters

1. Problem Statement

2. Methodology: VirtueBench

A. Data Construction

B. Evaluation Protocol

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

1. The "Zoom Lens" Experiment

2. The "Honesty vs. Bravery" Score

3. What They Found (The Plot Twist)

Why This Matters

1. Problem Statement

2. Methodology: VirtueBench

A. Data Construction

B. Evaluation Protocol

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities