INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Imagine you have a very smart, very confident robot friend who loves watching videos. You ask it, "What happened in this video?" and it answers with a story. Sometimes, the story is perfect. But often, the robot makes things up. It might say a character was wearing a red hat when they were actually wearing a blue one (it's lying about what it saw), or it might claim that a car can fly because it "knows" cars are cool, even though the video clearly shows the car crashing into a wall (it's lying about how the world works).

This paper introduces a new tool called INFACT to test exactly how good (or bad) these video-watching robots are at telling the truth.

Here is the breakdown of the paper using some everyday analogies:

1. The Two Types of "Lying"

The researchers realized robots make two different kinds of mistakes, and they need to be tested differently:

Faithfulness (The "Did you watch the video?" Test):
- The Analogy: Imagine you show a robot a video of a cat chasing a mouse. If the robot says, "The cat was sleeping," it failed the Faithfulness test. It didn't pay attention to the actual evidence in front of its eyes.
- The Goal: Does the robot stick to the facts shown in the video?
Factuality (The "Do you know how the world works?" Test):
- The Analogy: Imagine the video shows a person trying to walk on water. If the robot says, "Yes, that's a normal way to travel," it failed the Factuality test. Even if the video looks like someone walking on water (maybe it's a special effect), the robot should know that humans can't walk on water based on real-world physics.
- The Goal: Does the robot know the rules of reality, even when the video tries to trick it?

2. The "Stress Test" Gym

Most previous tests were like a calm walk in the park. The robots watched a clear, high-quality video and answered questions. But in the real world, videos are messy! They get blurry, have bad subtitles, or get edited.

INFACT is like a stress-test gym for these robots. It puts them through four specific drills:

Drill 1: The Clean Room (Base Mode): The robot watches a perfect video. This is just to see how smart it is normally.
Drill 2: The Foggy Window (Visual Degradation): The researchers add static, blur, or make the video look like it was recorded on a cheap phone.
- The Test: Can the robot still see the cat chasing the mouse, or does it get confused by the "fog" and start guessing?
Drill 3: The Gaslighting (Evidence Corruption): This is the tricky one. The researchers add fake subtitles or misleading text on top of the video.
- The Test: The video shows a door opening, but the fake subtitle says "The door is closing." Does the robot trust its eyes, or does it get tricked by the text? (Spoiler: Many robots get tricked easily).
Drill 4: The Time-Traveler (Temporal Intervention): The researchers take a video of someone making a sandwich (bread, then peanut butter, then jelly) and shuffle the frames so it looks like they put the jelly on first, then the peanut butter, then the bread.
- The Test: Does the robot realize the order is wrong? Or does it just say, "Yep, that looks like a sandwich," ignoring the chaos?

3. The Results: "Smart" Doesn't Mean "Reliable"

The researchers tested 14 different video robots (including big names like GPT-5 and Gemini). Here is what they found:

The "Overconfident" Problem: Many robots are great at the "Clean Room" test. They get high scores. But as soon as you add fog or fake text, their performance crashes. It's like a student who memorizes the answers to a practice test but fails the real exam when the questions are slightly different.
The "Time-Blind" Problem: When the researchers shuffled the video frames, many robots didn't even notice! They kept giving the same answer as if nothing changed. This is called Temporal Inertia. It's like watching a movie played backward and saying, "That's a perfectly normal story," because the robot isn't actually tracking the sequence of events, just the general "vibe."
The "Fact" Gap: The robots were much worse at the "Factuality" test (knowing real-world rules) than the "Faithfulness" test (watching the video). They often hallucinate wild physics or wrong historical facts with 100% confidence.

4. Why This Matters

Think of these Video-LLMs as the future of medical diagnosis, security monitoring, or educational tools.

If a medical robot watches a surgery video and hallucinates that a tool was used when it wasn't, that's dangerous.
If a security robot watches a video of a person walking on water (a special effect) and reports it as a real event, that's a false alarm.

INFACT is the new "lie detector" that doesn't just ask, "Are you smart?" but rather, "Are you reliable when things get messy?"

The Bottom Line

The paper concludes that just because a robot gets an "A" on a clean test doesn't mean it can be trusted in the real world. We need to build robots that don't just memorize patterns but actually understand what they are seeing and know how the world works, even when the video is blurry, confusing, or playing tricks on them.

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

1. The Two Types of "Lying"

2. The "Stress Test" Gym

3. The Results: "Smart" Doesn't Mean "Reliable"

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The INFACT Benchmark

A. Fine-Grained Taxonomy

B. Four Evaluation Modes

C. Evaluation Metrics

3. Key Contributions

4. Experimental Results & Analysis

5. Significance

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

1. The Two Types of "Lying"

2. The "Stress Test" Gym

3. The Results: "Smart" Doesn't Mean "Reliable"

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: The INFACT Benchmark

A. Fine-Grained Taxonomy

B. Four Evaluation Modes

C. Evaluation Metrics

3. Key Contributions

4. Experimental Results & Analysis

5. Significance

More like this

Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Intelligence Inertia: Physical Principles and Applications

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates