Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

Imagine the world of scientific research as a massive, bustling Grand Library. Every day, thousands of new books (research papers) are written and dropped off at the front desk, hoping to be added to the permanent collection.

To get into the collection, these books must pass a strict Gatekeeper System called "Peer Review." Here's how it usually works:

The Author submits their book.
The Reviewers (experts who volunteer their time) read it and write a report: "This is great!" or "This needs work."
The Author gets the report, fixes the book, and writes a Rebuttal (a letter saying, "I fixed it! Here's why you were wrong about that one thing").
The Reviewer reads the rebuttal, maybe changes their mind, and gives a final verdict.

The Problem: The Library is Overwhelmed

Right now, the library is drowning. Too many books are being submitted, and there aren't enough Gatekeepers to read them.

The Bottleneck: Reviewers are exhausted, leading to slower decisions and lower quality checks.
The Cycle of Failure: Many authors submit books that aren't ready yet because they don't have a good way to check their own work before handing it in. They get rejected, fix it, and resubmit the same book multiple times, wasting everyone's time.

Scientists have tried to use AI (Large Language Models) to help. They want the AI to act like a "Pre-Check" for authors or a "Helper" for reviewers. But there was a big problem: The AI was being trained on bad recipes.

The "Bad Recipe" Problem

Imagine you are trying to teach a cooking student how to make a perfect cake.

Old Datasets: You gave them a recipe that said, "Take the cake after the student already fixed the burnt parts and added extra frosting." The student learns to judge the finished cake, not the raw batter. When they try to judge a new, raw cake, they get confused because the data didn't match the reality of the initial submission.
The Missing Step: Most old datasets also ignored the "Rebuttal" phase. They didn't teach the AI how to handle the back-and-forth argument between the author and the reviewer. It was like teaching a lawyer to write a closing statement but never letting them practice cross-examination.

The Solution: Introducing "Re2"

The authors of this paper built Re2, which is like a massive, perfectly organized training kitchen for AI.

Here is what makes Re2 special, using simple analogies:

1. The "Fresh Batter" Rule (Consistency)

In the old days, datasets mixed up "raw batter" (initial submissions) with "finished cakes" (revised versions). Re2 is strict: Every single paper in this dataset is the "Fresh Batter" version.

Why it matters: It ensures the AI learns to judge a paper exactly as a human reviewer sees it on day one. There is no confusion about which version is being discussed.

2. The "Giant Mosaic" (Diversity)

Old datasets were like a puzzle with only 5 pieces from one specific box (mostly just one conference, ICLR). Re2 is a giant mosaic made of 19,926 papers from 45 different conferences and workshops.

Why it matters: The AI learns that "good writing" looks different in a medical journal than in a computer science journal. It becomes a true generalist, not just an expert in one niche.

3. The "Conversation Simulator" (Multi-turn Rebuttal)

This is the biggest innovation. Re2 doesn't just show the Reviewer's note and the Author's reply as two separate documents. It stitches them together into a continuous, multi-turn conversation, like a chat log.

The Analogy: Imagine training a customer service bot. Old datasets gave it a script of "Complaint -> Response." Re2 gives it the entire phone call, including the pauses, the "Wait, I meant this..." moments, and the final agreement.
Why it matters: This allows AI to learn how to argue, negotiate, and clarify just like a real human reviewer. It can now act as a dynamic assistant that helps authors refine their work before they even submit it.

What Did They Do With It?

They took some standard AI models (like LLaMA and Qwen) and fed them this "Re2" training data.

Before Training: The AI was like a polite but clueless intern who would say "Great job!" to everything just to be nice.
After Training: The AI became a sharp, critical editor. It could predict if a paper would be accepted, guess the score, write a review that sounded human, and even engage in a back-and-forth debate about the paper's flaws.

The Bottom Line

Re2 is the first time we have a massive, high-quality, "real-world" dataset that captures the entire lifecycle of a scientific paper—from the first draft to the final argument.

By using this dataset, we can build AI tools that:

Help Authors: Act as a "pre-submission coach" to fix mistakes early, so fewer bad papers get submitted.
Help Reviewers: Act as a "co-pilot" to draft reviews and handle rebuttals, reducing the crushing workload on human volunteers.

In short, Re2 is the training manual that finally teaches AI how to be a real, helpful member of the scientific community, rather than just a fancy text generator.

Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

The Problem: The Library is Overwhelmed

The "Bad Recipe" Problem

The Solution: Introducing "Re2"

1. The "Fresh Batter" Rule (Consistency)

2. The "Giant Mosaic" (Diversity)

3. The "Conversation Simulator" (Multi-turn Rebuttal)

What Did They Do With It?

The Bottom Line

1. Problem Statement

2. Methodology

Data Collection and Scope

Data Processing and Consistency Assurance

3. Key Contributions

4. Experimental Results

5. Significance and Impact

Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

The Problem: The Library is Overwhelmed

The "Bad Recipe" Problem

The Solution: Introducing "Re2"

1. The "Fresh Batter" Rule (Consistency)

2. The "Giant Mosaic" (Diversity)

3. The "Conversation Simulator" (Multi-turn Rebuttal)

What Did They Do With It?

The Bottom Line

1. Problem Statement

2. Methodology

Data Collection and Scope

Data Processing and Consistency Assurance

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Self-Calibrating Language Models via Test-Time Discriminative Distillation

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Generating High Quality Synthetic Data for Dutch Medical Conversations

GIANTS: Generative Insight Anticipation from Scientific Literature