DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

🕵️‍♂️ The Problem: The "Super-Researcher" Who Might Be Lying

Imagine you have a brilliant, tireless research assistant (an AI agent) who can read thousands of scientific papers in seconds and write a 50-page report on a complex topic like "Climate Change Solutions" or "New Cancer Treatments." This is what Deep Research Agents do today. They are amazing, but they have a fatal flaw: they sometimes make things up.

They might mix up two different studies, invent a fake statistic, or cite a paper that doesn't exist.

The big question is: How do we catch these lies?

Usually, we hire human experts (PhD scientists) to read the AI's report and check if the facts are true. But the authors of this paper discovered a shocking truth: Even the experts are bad at this job.

🧠 The "Expert Fatigue" Experiment

The researchers hired PhD students to act as fact-checkers. They gave them a hidden "test" of claims they knew were true or false.

The Result: The experts only got 60% of the test questions right.
Why? Reading a 50-page report is like trying to find a needle in a haystack while running a marathon. The experts got tired, missed details, or got confused by the sheer volume of information.

If the "Gold Standard" (the experts) is only 60% accurate, how can we trust the AI? We can't. We need a better way to grade both the AI and the humans.

🔄 The Solution: The "Living" Report Card

Instead of treating the "correct answer" as a static, unchangeable fact (like a math problem), the authors propose Evolving Benchmarking.

Think of the "Truth" not as a stone statue, but as a living, breathing Wikipedia page that gets updated every time someone finds a better piece of evidence.

They call this system Audit-then-Score (AtS). Here is how it works, using a courtroom analogy:

⚖️ The Courtroom of Truth

The Judge (The Benchmark): The current "Truth" is the Judge's current ruling.
The Prosecutor (The AI Agent): The AI tries to prove a claim is true or false. It presents its evidence.
The Conflict: If the AI disagrees with the Judge's current ruling, it doesn't just get marked wrong. It gets to appeal.
The Appeal (The Audit): The AI says, "Wait! I found a new paper that proves the Judge is wrong!"
The Jury (The Human Expert): A human expert steps in. They don't just guess; they look at the AI's new evidence.
- If the AI's evidence is weak, the Judge's ruling stands.
- If the AI's evidence is stronger than the old one, the Judge changes their mind. The "Truth" is updated.
The Score: Now that the Truth has been updated, the AI is graded against the new Truth.

The Magic: By doing this over and over, the "Truth" gets better and better. The experts stop being tired labelers and become smart auditors who only check the hard parts. The AI gets smarter because it learns from the corrections.

🛠️ The Tools: DeepFact-Bench and DeepFact-Eval

The paper introduces two main tools built on this idea:

1. DeepFact-Bench (The Evolving Playground)

This is the dataset where the testing happens.

Old Way: A static test where the answers are locked in stone. If the test is flawed, everyone fails.
DeepFact Way: A dynamic test where the answers are revisable. If an AI finds a mistake in the test itself, the test gets fixed! It's like a video game that updates its own rules to be fairer as players get better.

2. DeepFact-Eval (The Super-Inspector)

This is a new AI agent designed specifically to fact-check.

How it works: Instead of just skimming a sentence (like a human might when tired), DeepFact-Eval acts like a detective.
- It breaks a claim into tiny pieces.
- It searches the entire internet for evidence.
- It reads full documents, not just snippets.
- It checks if the evidence actually supports the claim or if it's just "vaguely related."
The Result: It is much better at finding the truth than previous tools, and it works fast enough to be practical.

📈 The Results: A Win-Win-Win

When they ran this system:

The Humans got smarter: When experts acted as "Auditors" (checking the AI's work) instead of "Labelers" (guessing the answer from scratch), their accuracy jumped from 60% to 90%. They were no longer tired; they were focused on the hard disputes.
The AI got smarter: The new AI agent (DeepFact-Eval) beat all other fact-checkers.
The Truth got clearer: The "Benchmark" (the test) became more accurate over time because the "mistakes" in the test were fixed by the AI and the human auditors working together.

🌟 The Big Picture Takeaway

This paper teaches us that we shouldn't expect humans or AI to be perfect on the first try.

Old Mindset: "Here is the test. Take it. If you fail, you failed."
New Mindset (DeepFact): "Let's work together. You try to solve it, I'll check your work, we'll argue about the hard parts, and together we will figure out the real answer."

It turns fact-checking from a one-time exam into a continuous conversation where both the teacher (the benchmark) and the student (the AI) learn and improve together.

1. Problem Definition

The paper addresses the critical challenge of verifying factuality in Deep Research Reports (DRRs) generated by Search-Augmented Large Language Model (LLM) agents. Unlike standard fact-checking tasks involving short, atomic claims (e.g., "Who won the 2020 election?"), DRRs involve complex, multi-hop reasoning synthesizing vast amounts of technical literature to answer PhD-level research questions.

Key Challenges Identified:

Limitations of Existing Verifiers: Current automated tools (e.g., snippet-matching, simple retrieval) fail on DRRs because they cannot handle claims synthesized across multiple documents or claims lacking explicit citations. They often conflate "supported by a text" with "supported by scientific consensus."
Brittleness of Static Benchmarks: The standard approach of creating a static "gold standard" dataset via human expert annotation is flawed. The authors demonstrate that even PhD-level experts, when acting as one-shot labelers, achieve only 60.8% accuracy on verifiable claims within their own domains. This is due to cognitive overload, domain fragmentation, and the difficulty of cross-referencing extensive literature.
The "Ground Truth" Fallacy: Assuming expert labels are infallible leads to unreliable benchmarks. As models improve, the noise in human annotations becomes the primary bottleneck, making it impossible to distinguish between model errors and annotation errors.

2. Methodology: Audit-then-Score (AtS)

To overcome the unreliability of static benchmarks, the authors propose a new paradigm called Evolving Benchmarking via Audit-then-Score (AtS). This framework treats the benchmark not as a fixed dataset but as a dynamic, revisable consensus that co-evolves with the agents being evaluated.

The AtS Protocol Workflow:

Initialize: Start with a seed benchmark ( $B_0$ ) created by expert annotation.
Evaluate (Challenge): A "Challenger" agent ( $M_t$ ) evaluates claims against the current benchmark state ( $B_t$ ).
Dispute: If the Challenger's verdict ( $\hat{y}_i$ ) disagrees with the benchmark label ( $y^{(t)}_i$ ), it submits a proposal ( $U_{M,t}$ ) containing a new verdict and an evidence-based rationale.
Audit: An Auditor (a human expert or a trusted agent) adjudicates the dispute. The Auditor compares the Challenger's rationale against the existing one. If the Challenger provides superior evidence or reasoning, the update is accepted.
Evolve & Score: Accepted updates modify the benchmark to create the next version ( $B_{t+1}$ ). The Challenger is then scored against this refined ground truth.

Key Components:

DeepFact-Bench: The instantiated benchmark for DRRs. It contains 944 claims across 20 reports in six domains. Crucially, every claim includes an auditable rationale, allowing for future challenges and corrections.
Micro-Gold Sets: To measure the reliability of annotators (human or agent) during the evolution process, the authors embed hidden "micro-gold" claims (adversarially constructed errors or verified truths) into the dataset. This allows them to track accuracy improvements without revealing the answers to the annotators.
DeepFact-Eval: A specialized verification agent designed to serve as a Challenger. It employs a multi-step workflow:
- Breadth-Oriented Query Planning: Generates diverse search queries to cover the document space.
- Document Search & Summarization: Retrieves and summarizes full documents (not just snippets).
- Depth-Oriented Detail Questioning: Generates follow-up questions to extract fine-grained details missed in summaries.
- Grouped Verification: A "Lite" variant that verifies semantically related claims jointly to reduce computational costs.

3. Key Contributions

Empirical Evidence of Expert Fallibility: Through a controlled study, the authors proved that unassisted experts achieve only 60.8% accuracy on DRR claims. This invalidates the assumption that static human-labeled datasets are reliable "ground truth" for complex reasoning tasks.
The AtS Framework: A novel protocol where benchmarks and agents co-evolve. The authors show that when experts act as Auditors (reviewing agent proposals) rather than one-shot labelers, their accuracy on micro-gold sets rises to 90.9% over four rounds.
Agent-as-Auditor Viability: The study demonstrates that stronger agents can effectively audit weaker agents, and that agent auditors can consolidate evidence to outperform individual verifiers, suggesting a path toward fully autonomous, self-improving evaluation ecosystems.
DeepFact-Bench & DeepFact-Eval: The release of a versioned, auditable benchmark and a state-of-the-art verifier that outperforms existing methods.

4. Experimental Results

Benchmark Evolution (AtS Validation):

Human Improvement: In Round 0 (expert-only), micro-gold accuracy was 60.8%. After three rounds of auditing increasingly capable agents (SmolAgents $\to$ DeepFact-Eval GPT-4.1 $\to$ DeepFact-Eval GPT-5), expert accuracy on micro-golds improved monotonically to 90.9%.
Agent Auditing: When agents audited each other, the "audited" accuracy consistently exceeded the "solo" accuracy of the individual agents, proving that the auditing process consolidates complementary evidence and catches oversights.

Verifier Performance (DeepFact-Eval):

Superior Accuracy: DeepFact-Eval achieved 83.4% accuracy on DeepFact-Bench, significantly outperforming traditional fact-checkers (e.g., SAFE at 55.9%, VeriScore at 52.5%) and deep-research baselines (e.g., GPT-Researcher at 69.1%).
Precision-Recall Balance: Unlike snippet-based methods that have high precision but low recall (defaulting to "unsupported" when evidence isn't a direct match), DeepFact-Eval achieves both high precision and high recall by performing deep, targeted reasoning over full documents.
Efficiency: The "Grouped" variant of DeepFact-Eval reduces token usage and cost by up to 70% (e.g., $0.21 vs $1.16 per claim) with minimal accuracy loss, making it practical for large-scale deployment.
Generalization: DeepFact-Eval transfers well to external datasets (SciFact, ExpertQA, Factcheck-Bench). Post-hoc auditing of disagreements revealed that many "errors" were actually annotation divergences or ambiguous labels in the static benchmarks, further validating the need for evolving benchmarks.

5. Significance and Impact

Paradigm Shift in Evaluation: The paper challenges the NLP community's reliance on static, human-labeled "gold standards" for complex tasks. It proposes that as AI approaches or surpasses human expertise, evaluation must become a dynamic, auditable dialogue rather than a static snapshot.
Scalable Verification: By demonstrating that agents can audit agents and that experts are more reliable as auditors than labelers, the paper provides a scalable path for verifying high-stakes scientific and technical content where human-only verification is too slow or expensive.
Trustworthy AI Research: DeepFact-Bench and DeepFact-Eval provide the necessary infrastructure to detect hallucinations and reasoning failures in deep research agents, which is critical for the safe deployment of AI in scientific discovery and policy-making.

In summary, DeepFact establishes that the future of factuality evaluation lies in co-evolution: benchmarks must evolve alongside the agents they test, utilizing a rigorous audit process to refine ground truth continuously.