DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Imagine you've hired a super-smart, tireless research assistant (an AI) to write a 20-page expert report on a complex topic, like "The future of fusion energy" or "The psychological effects of social media." You ask it to be thorough, accurate, and well-sourced.

The problem? How do you know if the report is actually good?

If you just read it, you might miss subtle errors. If you ask another AI to grade it, it might be too polite or miss the deep technical details. And if you ask a human expert, it takes them days to read and grade every single report.

This paper introduces DEER (Deep research Expert Report benchmark), which is essentially a super-charged, expert-designed grading system to test how well these AI research assistants are actually doing their jobs.

Here is a breakdown of how DEER works, using some everyday analogies:

1. The Problem: The "Vague Teacher"

Previously, grading these AI reports was like having a teacher who says, "Write a good essay," but doesn't tell you what "good" means.

The Issue: One AI might write a report that looks beautiful but is full of lies. Another might write a boring report that is 100% true. Old grading systems couldn't tell the difference well.
The Fix: DEER is like a strict, detailed rubric created by a panel of real-world experts (scientists, historians, engineers). Instead of saying "be good," it says: "You must have 5 sources, you must explain the history, you must not make up numbers, and your conclusion must match your evidence."

2. The Two-Part Test: The "Chef's Kitchen"

DEER judges the AI reports using two distinct methods, like a food critic tasting a dish and a health inspector checking the kitchen.

Part A: The "Flavor & Presentation" (Report Quality)

This checks if the report is well-written and actually answers the question.

The Analogy: Imagine a chef is asked to make a "Spicy Thai Curry."
- Did they use the right ingredients? (Completeness)
- Is the story of the dish logical? (Reasoning)
- Is the plating neat? (Structure & Style)
The Innovation: DEER doesn't just ask a generic AI to taste it. It gives the AI a special cheat sheet (Expert Evaluation Guidance) that says, "If the curry isn't spicy, give it a low score," or "If they didn't mention the fish sauce, it's a fail." This ensures the AI grader knows exactly what to look for, just like a human expert would.

Part B: The "Fact-Check Kitchen" (Information Verification)

This is the most unique part. It checks if the facts are actually true.

The Analogy: Imagine the chef claims, "I used fresh, organic tomatoes from the local farm."
- Old systems would only check the tomatoes the chef labeled as "organic."
- DEER is like a detective who checks every single sentence. If the chef says, "The tomatoes were red," but didn't label it, DEER looks at the previous sentence where the chef mentioned the farm and connects the dots.
The "Back-Tracking" Trick: Sometimes an AI writes a fact without a citation, but the evidence was mentioned three sentences ago. DEER uses a "Back-Tracking" mechanism to find that hidden evidence. It's like a detective saying, "You didn't show me the receipt for the tomatoes, but you mentioned buying them in paragraph 2, so I'm going to check that receipt."

3. The Results: What Did We Learn?

The researchers tested top AI models (like those from OpenAI, Google, and Anthropic) using DEER. Here is what they found:

The Good News: The AIs are getting very good at formatting. They know how to make a report look professional, use nice headings, and write in a polite tone. They are like excellent typists.
The Bad News: The AIs are still struggling with deep thinking.
- They often miss the specific details the user asked for (like a chef forgetting the "spicy" part).
- They sometimes make logical jumps (saying "A causes B" without proving it).
- They tend to rely on too few sources, like a student who only reads one Wikipedia page and calls it a "research paper."

4. Why This Matters

Before DEER, we were just guessing which AI was the best researcher. Now, we have a diagnostic tool.

Think of DEER as an X-ray machine for AI reports. It doesn't just give you a grade of "A" or "F." It tells you exactly where the AI failed:

"This AI is great at writing, but it hallucinates facts."
"This AI is great at finding facts, but it can't organize them logically."

Summary

DEER is a new standard for testing AI researchers. It combines a detailed expert checklist with a detective-style fact-checker to ensure that when an AI writes a report, it's not just sounding smart—it's actually being accurate, thorough, and helpful. It moves us from asking "Does this look good?" to "Is this actually true and useful?"

Here is a detailed technical summary of the paper "DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation."

1. Problem Statement

Recent advances in Large Language Models (LLMs) have enabled "Deep Research" systems capable of generating expert-level reports through multi-step reasoning and evidence synthesis. However, evaluating these reports remains a significant bottleneck due to three primary challenges:

Underspecified Criteria: Existing benchmarks often rely on coarse, high-level dimensions or LLM-generated rubrics, lacking the fine-grained, expert-defined criteria necessary for precise assessment of report quality.
Domain Expertise Gap: Standard LLM-based judges often fail to identify subtle errors, logical leaps, or domain-specific inaccuracies that require specialized human knowledge.
Incomplete Verification: Current source verification methods typically only check explicitly cited claims, leaving uncited claims (which may contain hallucinations or unsupported assertions) unexamined, thus failing to ensure the factual integrity of the entire report.

2. Methodology: The DEER Framework

The authors propose DEER (Deep research Expert Report benchmark), a systematic evaluation framework comprising 50 report-generation tasks across 13 domains. The methodology is built on three core pillars:

A. Data Construction

Task Derivation: The authors analyzed 5,842 real-world user queries to establish a topic distribution. They utilized Humanity's Last Exam (HLE) as a source of high-difficulty, expert-written seed questions.
Reformulation: Domain experts (Master's degree or equivalent) reformulated HLE QA items into open-ended, report-style prompts. Crucially, they removed answer-revealing elements (e.g., specific conclusions) to force the model to generate independent reasoning.
Expert Guidance: For each task, experts created Expert Evaluation Guidance (EG), a document listing mandatory content elements and specific expectations, ensuring the evaluation criteria are grounded in domain knowledge.

B. Evaluation Taxonomy & Rubrics

DEER introduces a hierarchical taxonomy derived from 80 established reporting standards across 20 domains (e.g., PRISMA, APA, IEEE).

Structure: The taxonomy consists of 7 major dimensions and 25 sub-dimensions.
Operationalization: These are translated into 101 fixed, fine-grained rubric items.
Scoring Dimensions: Each item is scored on a 1–10 scale based on two aspects:
- Coverage (C): Whether required components are present and fully addressed.
- Quality (Q): The depth, logic, and rigor of the written content.
Hybrid Scoring: The framework combines LLM-as-a-Judge scoring (guided by the fixed rubrics and Expert Evaluation Guidance) with Information Verification modules.

C. Information Verification Architecture

To address the limitation of checking only cited claims, DEER employs a novel verification pipeline:

Claim Extraction: The system extracts atomic claims from the report and classifies them into six types (A–F), distinguishing between explicit citations, implicit citations, and uncited claims.
Semantic Back-Tracking: A key innovation is the Back-Tracking mechanism. For claims without explicit citations (Types B and C), the system semantically traces dependencies to earlier sentences or sections to recover the necessary evidence sources.
Verification: The system verifies both cited and recovered uncited claims against external sources, generating quantitative metrics for Information Integrity (factuality, support, diversity) and Information Sufficiency (coverage, volume).

3. Key Contributions

Systematic Benchmark: DEER is the first benchmark to evaluate deep research reports using a hierarchical, expert-validated taxonomy (7 dimensions, 25 sub-dimensions) operationalized into 101 fixed rubric items.
Expert-Guided LLM Judging: The introduction of Expert Evaluation Guidance significantly improves the alignment between LLM judges and human experts, allowing LLMs to detect domain-specific errors they would otherwise miss.
Comprehensive Claim Verification: The proposed architecture verifies all claims (not just cited ones) by recovering implicit citations via semantic back-tracking, providing a more complete assessment of report factuality.
Diagnostic Interpretability: Unlike aggregate scoring, DEER provides fine-grained diagnostic signals (e.g., specific failures in "Scope" or "Reasoning") that allow developers to pinpoint system weaknesses.

4. Experimental Results

The authors evaluated five major deep research systems (including WebThinker, Qwen3, Gemini 2.5, Claude Opus, and OpenAI's Deep Research) against the benchmark.

Performance Trends:
- Strengths: Systems generally perform well on Structural Coherence, Format & Style, and Ethics.
- Weaknesses: Significant gaps remain in Request Fulfillment (specifically defining scope and assumptions) and Analytical Soundness (logical completeness and reasoning depth).
- Search vs. Reasoning: Interestingly, models using reasoning without web search sometimes outperformed those with search on report-quality metrics, suggesting that integrating diverse external sources can sometimes blur argument structures if not managed well.
Correlation with Humans:
- The addition of Expert Evaluation Guidance increased the correlation between LLM scores and human expert judgments (Pearson's $r$ increased from ~0.64 to 0.75).
- Inter-evaluator reliability (Krippendorff's $\alpha$ ) also peaked with Expert Guidance (0.55), demonstrating that the guidance clarifies evaluation criteria for different LLM judges.
Verification Efficiency: The claim extraction and verification module (using GPT-5-mini with batch processing) achieved high recall (~~92%) and classification F1 (~~69%) at a low cost (~$0.10 per 1k claims), proving the scalability of the automated verification approach.

5. Significance

Beyond Performance Metrics: DEER shifts the focus from simple "pass/fail" or aggregate scores to diagnostic evaluation. It identifies why a system fails (e.g., lack of scope definition vs. hallucinated facts), enabling targeted model improvement.
Standardization: By grounding evaluation in established academic and professional standards (e.g., medical, legal, engineering reporting norms), DEER sets a new standard for what constitutes a "high-quality" AI-generated report.
Reliability of Automation: The study demonstrates that with proper expert scaffolding (guidance and fixed rubrics), LLMs can serve as reliable, consistent, and cost-effective judges for complex, expert-level tasks, reducing the dependency on expensive human evaluation for every iteration.
Holistic Fact-Checking: The back-tracking mechanism addresses a critical blind spot in current RAG (Retrieval-Augmented Generation) evaluation, ensuring that the entirety of a report's narrative is fact-checked, not just the explicitly cited sentences.

In conclusion, DEER provides a robust, interpretable, and scalable framework for evaluating the next generation of AI research agents, highlighting that while structural generation is mature, true expert-level reasoning and comprehensive fact-checking remain significant challenges.