Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

The paper introduces VideoHV-Agent, a multi-agent framework that improves long video understanding by replacing reactive retrieval with a structured "think-then-verify" process where hypotheses are formulated, clues are derived, and evidence is grounded before generating a final answer, achieving state-of-the-art accuracy with enhanced interpretability and lower computational cost.

Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a mystery in a 3-hour movie, but you only have 5 minutes to figure out the answer.

Most current AI "detectives" try to solve this by frantically scanning the entire movie, looking for anything that might be related to the question. They shout, "I saw a red car! Maybe that's the clue!" or "I heard a dog bark! That must be it!" This approach is like searching a haystack for a needle by grabbing random handfuls of hay. It's slow, confusing, and often leads to wrong answers because the AI gets distracted by irrelevant details.

VideoHV-Agent is a new, smarter detective. Instead of frantically searching, it follows a simple rule: "Think before you find."

Here is how it works, broken down into a simple story:

1. The Setup: The "Clue Board"

Imagine the AI has a whiteboard. First, it quickly skims the movie and writes a short summary on the board. It doesn't read every single word of the script; it just gets the "gist" of the story.

2. The Four Specialists (The Agents)

Instead of one confused detective, this system uses a team of four specialists who work together:

  • The Thinker (The Strategist):
    The Thinker looks at the question and the possible answers. Instead of guessing, it asks: "If Answer A is true, what must happen in the movie? If Answer B is true, what must happen?"

    • Analogy: It's like a lawyer building a case. "If the suspect is innocent, he must have been at the park. If he is guilty, he must have been at the bank." The Thinker turns vague answers into specific, testable predictions (Hypotheses).
  • The Judge (The Filter):
    The Judge looks at all the Thinker's predictions and says, "Okay, we have too many ideas. What is the one single thing we need to look for to tell them apart?"

    • Analogy: If the Thinker says "Look for a red car" and "Look for a blue car," the Judge says, "No, just look for which car is moving." It creates a sharp, focused "Clue" so the team doesn't waste time looking at the wrong things.
  • The Verifier (The Investigator):
    This is the only one who actually goes back to the movie. But it doesn't watch the whole thing! It only watches the specific 5-second clip where the "Clue" is likely to appear. It checks the footage closely.

    • Analogy: Instead of searching the whole city, the Verifier goes straight to the specific street corner the Judge pointed to. It checks: "Is the car moving? Yes. Is it red? No." It gathers hard evidence.
  • The Answer Agent (The Judge's Verdict):
    Once the Verifier brings back the evidence, the Answer Agent combines it with the summary and the original question to declare the final winner. It doesn't guess; it decides based on proof.

3. The "Self-Correction" Loop

What if the Verifier looks at the clip and says, "I can't see enough to be sure"?
Old systems would just guess and hope for the best. VideoHV-Agent says, "Okay, my hypothesis was too vague. Let's try again." It refines the question, looks at a different part of the movie, and tries again until it finds the truth.

Why is this better?

  • No More Hallucinations: Because it demands proof before deciding, it doesn't make up facts.
  • Speed: It doesn't waste time watching the whole movie. It only watches the 10 seconds that matter.
  • Logic: It follows a clear chain of reasoning (If X, then Y. Let's check Y.) rather than just guessing based on what words sound similar.

In short: VideoHV-Agent stops the AI from being a frantic scavenger and turns it into a methodical detective who plans the investigation, identifies the critical clue, checks the evidence, and only then delivers the verdict.