Paper Reconstruction Evaluation: Evaluating… — Plain-Language Explanation

The Big Idea: The "Blind Chef" Test

Imagine you have a world-famous chef (the Original Paper) who has created a masterpiece dish. Now, imagine you want to test a new AI robot chef (the Coding Agent) to see if it can recreate that dish perfectly.

But here's the catch: You don't give the robot the full recipe or the ingredients. Instead, you give it a very short summary note (the Overview) and a few photos of the final plate (the Figures/Tables).

The robot has to write the full recipe and cook the dish from scratch based only on that tiny note.

This paper introduces a new way to grade these robot chefs. Instead of just asking, "Does the dish look good?" or "Does it taste good?", the researchers created a two-part grading system to catch a specific problem: Hallucinations (making things up).

The Two Grading Axes

The researchers realized that a robot chef could be great at two different things, but terrible at the other. They split the grading into two separate categories:

1. Presentation (The "Plating" Score)

What it is: How beautiful and well-organized the dish looks. Is the sauce drizzled nicely? Is the text written in a professional font? Does it sound like a real scientific paper?
The Analogy: Imagine a robot that writes a recipe that sounds incredibly fancy, uses big words, and follows the perfect structure. It gets a high score here because it looks like a real paper.
The Problem: Just because it looks good doesn't mean the ingredients are real.

2. Hallucination (The "Fake Ingredient" Score)

What it is: Did the robot invent facts that aren't true? Did it claim the dish uses "dragon scales" when the original paper said "chicken"?
The Analogy: This is where the robot gets caught. It might say, "I used 500 grams of gold dust," when the original paper said "5 grams of salt."
The Catch: The researchers found that the robots that write the best-looking papers are often the ones making the most fake facts.

The Big Discovery: The "Style vs. Truth" Trade-off

The researchers tested two famous AI "chefs": Claude Code and Codex.

Claude Code (The Fancy Writer):
- Presentation: ⭐⭐⭐⭐⭐ (Excellent! It writes beautifully and captures the tone perfectly.)
- Hallucinations: ⚠️🚨 (Dangerous! It invents about 10 fake facts per paper on average.)
- Analogy: It's like a novelist who writes a gripping story about a historical event, but they accidentally invent a whole new war that never happened. It reads great, but it's historically wrong.
Codex (The Boring Truth-Teller):
- Presentation: ⭐⭐⭐ (Okay. It's a bit dry and misses some details.)
- Hallucinations: ✅ (Safe! It only invents about 3 fake facts per paper.)
- Analogy: It's like a strict accountant who writes a report that is boring and lacks flair, but the numbers are mostly correct.

The Conclusion: As AI gets smarter, it gets better at sounding smart, but it doesn't necessarily get better at being accurate. In fact, the smarter it gets at writing, the more confidently it lies.

How They Tested It (The "PaperRecon" Method)

To prove this, they built a benchmark called PaperWrite-Bench.

The Setup: They took 51 real, high-quality scientific papers (from top conferences like NeurIPS and CVPR).
The Compression: They stripped these papers down to a tiny "Overview" file (like a cheat sheet).
The Challenge: They asked the AI agents to rebuild the entire paper from just that cheat sheet.
The Comparison: They compared the AI's rebuilt paper against the original.
- Did it include the right graphs? (Presentation)
- Did it invent fake numbers or fake citations? (Hallucination)

Why This Matters

This paper is a wake-up call for the scientific community.

The Risk: If we let AI write papers without checking, we might end up with a flood of beautiful-looking papers that are full of lies.
The Solution: We need new tools (like the one they built) that don't just check if a paper "sounds good," but actually fact-check every single sentence against the source material.

In short: The paper warns us that in the age of AI, a beautiful lie is more dangerous than a boring truth. We need to stop grading papers just on how they look and start grading them on whether they are actually true.

Based on the provided preprint, here is a detailed technical summary of the paper "Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers."

1. Problem Definition

The rapid advancement of AI coding agents (e.g., "AI Scientists") has raised concerns regarding their ability to autonomously generate scientific research papers. While these agents can automate parts of the research pipeline, there is a critical lack of rigorous, systematic frameworks to evaluate the quality and risks (specifically hallucinations) of the papers they produce. Existing evaluation methods often rely on AI reviewers, which have been shown to be susceptible to "hallucination bias" (assigning higher scores to papers with severe fabrications) or focus only on surface-level errors like citation mismatches. The authors argue that a unified understanding of AI writing reliability is missing, necessitating a new evaluation paradigm that disentangles presentation quality from factual correctness.

2. Methodology: Paper Reconstruction Evaluation (PaperRecon)

The authors propose PaperRecon, a novel evaluation framework designed to isolate and test the writing capabilities of coding agents.

The Reconstruction Task: Instead of asking agents to generate a paper from scratch (which introduces variability in idea generation), the framework asks agents to reconstruct an existing, high-quality paper from minimal resources.
- Inputs: A compressed research_overview.md (summary of motivation, method, results), simplified figures/tables, the original bibliography (with abstracts), and the original codebase (if available).
- Output: A full LaTeX paper reconstructed by the agent.
Evaluation Axes: The generated paper is compared against the Ground Truth (original) paper along two orthogonal dimensions:
1. Presentation Quality: Evaluated using a Rubric.
  - A rubric is pre-constructed for each section (Abstract, Intro, Method, etc.) identifying key verifiable elements (e.g., specific claims, method steps).
  - An LLM judge scores the generated section on a 1–5 scale based on how well it covers these elements, including context checks for figures and tables.
2. Hallucination Detection: Evaluated via Agentic Verification.
  - Stage 1 (Claim Extraction): An LLM extracts concrete, verifiable claims from the generated text and classifies them as Supported, Neutral, or Contradictory (hallucination) relative to the Ground Truth.
  - Stage 2 (Verification): A coding agent (Claude Code) re-examines claims flagged as Contradictory using the original source files (LaTeX, code, data) to filter out false positives and confirm factual errors.
Benchmark: The framework is applied to PaperWrite-Bench, a dataset of 51 papers from top-tier venues (NeurIPS, ICLR, CVPR, ACL, etc.) published after 2025, covering diverse domains (ML, CV, NLP, Multimedia).

3. Key Contributions

PaperRecon Framework: The first systematic evaluation framework specifically for scientific writing that separates presentation fidelity from factual accuracy.
PaperWrite-Bench: A curated benchmark of 51 recent, high-quality papers designed to test agent reconstruction capabilities under realistic constraints.
Quantitative Analysis of Trade-offs: The study provides the first large-scale quantification of the trade-off between writing fluency and hallucination rates in modern coding agents.

4. Experimental Results

The authors evaluated several state-of-the-art agents, including Claude Code (Anthropic) and Codex (OpenAI), across different model versions (e.g., Sonnet 4/4.6, GPT-5/5.4).

Presentation vs. Hallucination Trade-off:
- Claude Code achieved higher presentation quality (Rubric scores ~3.86) than Codex, producing papers that better captured the structure and key elements of scientific writing.
- However, Claude Code suffered from a high rate of hallucinations, averaging >10 major contradictory claims per paper.
- Codex produced fewer hallucinations (averaging ~~3 per paper with GPT-5.4) but achieved lower presentation scores (~~3.59), often missing key details or failing to articulate complex scientific points as effectively.
Model Progression: Both presentation quality and hallucination reduction improved with newer model versions (e.g., Sonnet 4.6 vs. Sonnet 4), suggesting PaperRecon is a reliable metric for tracking AI writing progress.
Citation Accuracy: Codex produced significantly fewer hallucinated citations than Claude Code, reinforcing the trade-off between coverage (F1 score) and factual reliability.
Human Validation: The rubric-based evaluation showed a strong correlation ( $\tau_b = 0.578$ ) with human expert judgments, validating the framework's reliability.

5. Significance and Implications

Risk Assessment: The findings highlight a critical risk: agents that produce highly fluent, well-structured papers (high presentation scores) may simultaneously generate a high volume of subtle but dangerous factual errors (hallucinations). This suggests that "good-looking" AI papers may be more deceptive than obvious, low-quality ones.
Evaluation Standard: PaperRecon moves beyond simple "accept/reject" review metrics, offering a granular tool for the research community to monitor the safety and reliability of AI-driven scientific writing.
Future Directions: The work underscores the need for evaluation protocols that explicitly penalize hallucinations, even in papers with high stylistic quality, and suggests that future AI Scientist systems must integrate robust fact-checking mechanisms before publication.

In summary, the paper establishes that while current AI agents are becoming proficient at the style of scientific writing, they remain prone to significant factual errors, and PaperRecon provides the necessary tools to measure and mitigate these risks.

Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers