DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

This paper introduces DeepFact, a framework that addresses the brittleness of static factuality benchmarks for deep research reports by proposing an Evolving Benchmarking via Audit-then-Score (AtS) methodology, which significantly improves expert verification accuracy and enables the development of a high-performing document-level verification agent.

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

Published 2026-03-09
📖 5 min read🧠 Deep dive

🕵️‍♂️ The Problem: The "Super-Researcher" Who Might Be Lying

Imagine you have a brilliant, tireless research assistant (an AI agent) who can read thousands of scientific papers in seconds and write a 50-page report on a complex topic like "Climate Change Solutions" or "New Cancer Treatments." This is what Deep Research Agents do today. They are amazing, but they have a fatal flaw: they sometimes make things up.

They might mix up two different studies, invent a fake statistic, or cite a paper that doesn't exist.

The big question is: How do we catch these lies?

Usually, we hire human experts (PhD scientists) to read the AI's report and check if the facts are true. But the authors of this paper discovered a shocking truth: Even the experts are bad at this job.

🧠 The "Expert Fatigue" Experiment

The researchers hired PhD students to act as fact-checkers. They gave them a hidden "test" of claims they knew were true or false.

  • The Result: The experts only got 60% of the test questions right.
  • Why? Reading a 50-page report is like trying to find a needle in a haystack while running a marathon. The experts got tired, missed details, or got confused by the sheer volume of information.

If the "Gold Standard" (the experts) is only 60% accurate, how can we trust the AI? We can't. We need a better way to grade both the AI and the humans.


🔄 The Solution: The "Living" Report Card

Instead of treating the "correct answer" as a static, unchangeable fact (like a math problem), the authors propose Evolving Benchmarking.

Think of the "Truth" not as a stone statue, but as a living, breathing Wikipedia page that gets updated every time someone finds a better piece of evidence.

They call this system Audit-then-Score (AtS). Here is how it works, using a courtroom analogy:

⚖️ The Courtroom of Truth

  1. The Judge (The Benchmark): The current "Truth" is the Judge's current ruling.
  2. The Prosecutor (The AI Agent): The AI tries to prove a claim is true or false. It presents its evidence.
  3. The Conflict: If the AI disagrees with the Judge's current ruling, it doesn't just get marked wrong. It gets to appeal.
  4. The Appeal (The Audit): The AI says, "Wait! I found a new paper that proves the Judge is wrong!"
  5. The Jury (The Human Expert): A human expert steps in. They don't just guess; they look at the AI's new evidence.
    • If the AI's evidence is weak, the Judge's ruling stands.
    • If the AI's evidence is stronger than the old one, the Judge changes their mind. The "Truth" is updated.
  6. The Score: Now that the Truth has been updated, the AI is graded against the new Truth.

The Magic: By doing this over and over, the "Truth" gets better and better. The experts stop being tired labelers and become smart auditors who only check the hard parts. The AI gets smarter because it learns from the corrections.


🛠️ The Tools: DeepFact-Bench and DeepFact-Eval

The paper introduces two main tools built on this idea:

1. DeepFact-Bench (The Evolving Playground)

This is the dataset where the testing happens.

  • Old Way: A static test where the answers are locked in stone. If the test is flawed, everyone fails.
  • DeepFact Way: A dynamic test where the answers are revisable. If an AI finds a mistake in the test itself, the test gets fixed! It's like a video game that updates its own rules to be fairer as players get better.

2. DeepFact-Eval (The Super-Inspector)

This is a new AI agent designed specifically to fact-check.

  • How it works: Instead of just skimming a sentence (like a human might when tired), DeepFact-Eval acts like a detective.
    • It breaks a claim into tiny pieces.
    • It searches the entire internet for evidence.
    • It reads full documents, not just snippets.
    • It checks if the evidence actually supports the claim or if it's just "vaguely related."
  • The Result: It is much better at finding the truth than previous tools, and it works fast enough to be practical.

📈 The Results: A Win-Win-Win

When they ran this system:

  1. The Humans got smarter: When experts acted as "Auditors" (checking the AI's work) instead of "Labelers" (guessing the answer from scratch), their accuracy jumped from 60% to 90%. They were no longer tired; they were focused on the hard disputes.
  2. The AI got smarter: The new AI agent (DeepFact-Eval) beat all other fact-checkers.
  3. The Truth got clearer: The "Benchmark" (the test) became more accurate over time because the "mistakes" in the test were fixed by the AI and the human auditors working together.

🌟 The Big Picture Takeaway

This paper teaches us that we shouldn't expect humans or AI to be perfect on the first try.

  • Old Mindset: "Here is the test. Take it. If you fail, you failed."
  • New Mindset (DeepFact): "Let's work together. You try to solve it, I'll check your work, we'll argue about the hard parts, and together we will figure out the real answer."

It turns fact-checking from a one-time exam into a continuous conversation where both the teacher (the benchmark) and the student (the AI) learn and improve together.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →