INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

The paper introduces \textsc{INFACT}, a comprehensive diagnostic benchmark with 9,800 QA instances and fine-grained taxonomies that evaluates Video-LLMs on faithfulness and factuality under various induced degradation modes, revealing that high base accuracy does not guarantee robustness against hallucinations and that many models struggle significantly with temporal sensitivity.

Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you have a very smart, very confident robot friend who loves watching videos. You ask it, "What happened in this video?" and it answers with a story. Sometimes, the story is perfect. But often, the robot makes things up. It might say a character was wearing a red hat when they were actually wearing a blue one (it's lying about what it saw), or it might claim that a car can fly because it "knows" cars are cool, even though the video clearly shows the car crashing into a wall (it's lying about how the world works).

This paper introduces a new tool called INFACT to test exactly how good (or bad) these video-watching robots are at telling the truth.

Here is the breakdown of the paper using some everyday analogies:

1. The Two Types of "Lying"

The researchers realized robots make two different kinds of mistakes, and they need to be tested differently:

  • Faithfulness (The "Did you watch the video?" Test):

    • The Analogy: Imagine you show a robot a video of a cat chasing a mouse. If the robot says, "The cat was sleeping," it failed the Faithfulness test. It didn't pay attention to the actual evidence in front of its eyes.
    • The Goal: Does the robot stick to the facts shown in the video?
  • Factuality (The "Do you know how the world works?" Test):

    • The Analogy: Imagine the video shows a person trying to walk on water. If the robot says, "Yes, that's a normal way to travel," it failed the Factuality test. Even if the video looks like someone walking on water (maybe it's a special effect), the robot should know that humans can't walk on water based on real-world physics.
    • The Goal: Does the robot know the rules of reality, even when the video tries to trick it?

2. The "Stress Test" Gym

Most previous tests were like a calm walk in the park. The robots watched a clear, high-quality video and answered questions. But in the real world, videos are messy! They get blurry, have bad subtitles, or get edited.

INFACT is like a stress-test gym for these robots. It puts them through four specific drills:

  • Drill 1: The Clean Room (Base Mode): The robot watches a perfect video. This is just to see how smart it is normally.
  • Drill 2: The Foggy Window (Visual Degradation): The researchers add static, blur, or make the video look like it was recorded on a cheap phone.
    • The Test: Can the robot still see the cat chasing the mouse, or does it get confused by the "fog" and start guessing?
  • Drill 3: The Gaslighting (Evidence Corruption): This is the tricky one. The researchers add fake subtitles or misleading text on top of the video.
    • The Test: The video shows a door opening, but the fake subtitle says "The door is closing." Does the robot trust its eyes, or does it get tricked by the text? (Spoiler: Many robots get tricked easily).
  • Drill 4: The Time-Traveler (Temporal Intervention): The researchers take a video of someone making a sandwich (bread, then peanut butter, then jelly) and shuffle the frames so it looks like they put the jelly on first, then the peanut butter, then the bread.
    • The Test: Does the robot realize the order is wrong? Or does it just say, "Yep, that looks like a sandwich," ignoring the chaos?

3. The Results: "Smart" Doesn't Mean "Reliable"

The researchers tested 14 different video robots (including big names like GPT-5 and Gemini). Here is what they found:

  • The "Overconfident" Problem: Many robots are great at the "Clean Room" test. They get high scores. But as soon as you add fog or fake text, their performance crashes. It's like a student who memorizes the answers to a practice test but fails the real exam when the questions are slightly different.
  • The "Time-Blind" Problem: When the researchers shuffled the video frames, many robots didn't even notice! They kept giving the same answer as if nothing changed. This is called Temporal Inertia. It's like watching a movie played backward and saying, "That's a perfectly normal story," because the robot isn't actually tracking the sequence of events, just the general "vibe."
  • The "Fact" Gap: The robots were much worse at the "Factuality" test (knowing real-world rules) than the "Faithfulness" test (watching the video). They often hallucinate wild physics or wrong historical facts with 100% confidence.

4. Why This Matters

Think of these Video-LLMs as the future of medical diagnosis, security monitoring, or educational tools.

  • If a medical robot watches a surgery video and hallucinates that a tool was used when it wasn't, that's dangerous.
  • If a security robot watches a video of a person walking on water (a special effect) and reports it as a real event, that's a false alarm.

INFACT is the new "lie detector" that doesn't just ask, "Are you smart?" but rather, "Are you reliable when things get messy?"

The Bottom Line

The paper concludes that just because a robot gets an "A" on a clean test doesn't mean it can be trusted in the real world. We need to build robots that don't just memorize patterns but actually understand what they are seeing and know how the world works, even when the video is blurry, confusing, or playing tricks on them.