RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Imagine you are an author who just submitted a research paper to a big scientific conference. You receive a review back. It's polite, but it's also vague. The reviewer says, "Your experiments need more work," or "The writing could be clearer."

You nod, but you're stuck. How do I fix it? Do I run a new experiment? Which one? Do I rewrite the whole introduction or just one paragraph? Without specific instructions, the feedback feels like a weather report ("It might rain") rather than a map ("Take the umbrella to the left").

This is the problem the paper RBTACT tries to solve.

Here is the story of how they fixed it, using some simple analogies.

1. The Problem: The "Vague Chef"

Currently, we use AI (Large Language Models) to write these reviews. But these AI "chefs" often serve up generic dishes. They say, "Add more salt," but they don't tell you how much salt, where to put it, or what dish you are cooking. The result is a review that sounds nice but doesn't actually help the author improve the paper.

2. The Secret Ingredient: The "Rebuttal"

In the world of academic publishing, after a paper gets rejected or needs changes, the author gets a chance to write a rebuttal. This is their reply to the reviewers.

The Insight: The authors of this paper realized that the rebuttal is a goldmine of truth.
- If an author says, "You're right, I will add a new experiment in Section 3," that means the reviewer's comment was actionable (it worked!).
- If an author says, "No, you misunderstood, my paper is fine," that means the comment was defensive (it didn't lead to a fix).

Think of the rebuttal as a feedback loop. It tells us exactly which comments from the "past" actually caused a "change" in the "future."

3. The Solution: RBTACT (The "Rebuttal Teacher")

The team built a new AI system called RBTACT. Instead of just reading thousands of papers and guessing what a good review looks like, they taught the AI using the rebuttals as a teacher.

Here is how they trained it, step-by-step:

Step A: The "Matchmaker" (Building the Dataset)

They took 75,000 pairs of reviews and rebuttals from a real conference (ICLR 2024). They acted like a matchmaker, connecting specific sentences in the review to specific sentences in the rebuttal.

Review: "Your graph is hard to read."
Rebuttal: "We have redrawn Figure 2 with larger fonts and a better color scheme."
Result: The AI learns that "hard to read" + "redrawn with larger fonts" = Success.

Step B: The "Perspective" Filter

A full review is a messy mix of complaints about math, writing, and graphs. The authors realized it's easier to learn if you focus on one thing at a time. So, they taught the AI to generate reviews based on specific perspectives (like "The Experiments" or "The Writing").

Analogy: Instead of asking a mechanic to "fix the car" (which is vague), you ask them to "fix the brakes" or "fix the engine." The AI learns to give specific advice for specific parts of the paper.

Step C: The "Preference" Training (The Real Magic)

This is the most clever part. They didn't just show the AI the right answers; they showed it comparisons.

They showed the AI two possible reviews for the same paper.
- Review A: "Your experiments are weak." (Author replies: "We will try to fix this later.") -> Weak result.
- Review B: "Your experiment in Section 4 lacks a control group. Please add a control group using Dataset X." (Author replies: "We added the control group and updated Table 2.") -> Strong result.
The AI learned: "Oh! Review B is better because it actually got the author to do something."

They used a technique called Preference Optimization (DPO). Imagine a coach telling a player, "Don't just kick the ball; kick it here to score a goal." The AI learned to prioritize comments that lead to concrete actions.

4. The Result: The "GPS" Reviewer

When they tested RBTACT, the results were impressive.

Old AI: "Your writing is unclear." (Author is confused).
RBTACT: "In the third paragraph of the Introduction, the sentence about 'neural networks' is ambiguous. Please clarify if you mean 'convolutional' or 'recurrent' networks, and add a citation to Smith et al." (Author knows exactly what to do).

The AI didn't just sound smarter; it became more useful. It gave advice that authors could actually implement, like a GPS giving turn-by-turn directions instead of just saying, "Drive toward the city."

Summary

RBTACT is like a training program for AI reviewers. Instead of guessing what makes a good review, it looks at the "receipts" (the author's rebuttals) to see which comments actually led to changes. By learning from these real-world outcomes, the AI learned to stop giving vague advice and start giving a "to-do list" that authors can actually follow.

The takeaway: If you want to teach an AI to be helpful, don't just show it what to say. Show it what worked in the past.

Here is a detailed technical summary of the paper "RBTACT: Rebuttal as Supervision for Actionable Review Feedback Generation."

1. Problem Statement

Large Language Models (LLMs) are increasingly used to draft peer-review reports. However, current AI-generated reviews often suffer from three critical limitations:

Superficiality: They lack deep analysis and specific insights.
Generic Phrasing: They produce vague, boilerplate feedback.
Lack of Actionability: They fail to provide concrete, implementable guidance that authors can use to revise their work.

Existing methods often rely on prompting or fine-tuning on full reviews, which mix strengths, weaknesses, and questions. This makes it difficult to evaluate or optimize for "actionability" because author reactions (rebuttals) target specific parts of a review, not the whole document. Furthermore, existing datasets linking reviews to author responses are small-scale or lack fine-grained alignment.

2. Methodology

The authors propose RBTACT, a framework that treats author rebuttals as an implicit supervision signal to directly optimize review generation for actionability. The methodology consists of three main pillars:

A. The RMR-75K Dataset

To enable this supervision, the authors constructed a large-scale dataset named RMR-75K (Review-Map-Rebuttal), containing 75,542 examples derived from ICLR 2024.

Segment-Level Alignment: Unlike previous works that align at the sentence level, RMR-75K maps specific review segments (weaknesses/questions) to specific rebuttal segments that address them.
Perspective Conditioning: Each review segment is labeled with one of seven perspectives (e.g., Experiments, Writing, Novelty, Theory).
Impact Categories: Each rebuttal segment is annotated with an "Impact Category" reflecting the author's reaction:
- CRP: Concrete Revision Performed (Highest actionability).
- SRP: Specific Revision Plan.
- VCR: Vague Commitment to Revise.
- DWC: Defend Without Change.
- DRF: Deflect/Reframe (Lowest actionability).
Construction Pipeline: The dataset was built using a two-stage alignment process (heuristic anchors followed by LLM-based semantic matching) and rigorous quality control, achieving high inter-annotator agreement ( $\kappa = 0.80$ ).

B. Task Definition: Perspective-Conditioned Segment Generation

Instead of generating a full review, the task is defined as generating a single focused comment based on the full paper text and a specified perspective (e.g., "Critique the Experiments"). This narrows the scope, promotes specificity, and allows for precise supervision.

C. Training Pipeline

The authors employ a two-stage training approach using the Llama-3.1-8B-Instruct model:

Supervised Fine-Tuning (SFT): The model is first trained on 13,300 pairs of (Paper, Perspective) $\to$ Gold Review Segment to establish a strong baseline for generating relevant comments.
Direct Preference Optimization (DPO): The model is further optimized using 21,822 preference pairs derived from the RMR-75K dataset.
- Preference Construction: For a given paper and perspective, pairs are formed where the "Winner" ( $y_w$ ) is a review segment that led to a high-impact rebuttal (CRP/SRP) and the "Loser" ( $y_\ell$ ) is one that led to a low-impact rebuttal (DWC/DRF).
- Objective: The DPO loss encourages the model to generate comments that historically triggered concrete author actions, effectively using the rebuttal outcome as a reward signal.

3. Key Contributions

RBTACT Framework: The first framework to utilize author rebuttals as an implicit preference signal to directly optimize LLMs for generating actionable review feedback.
RMR-75K Dataset: A large-scale, high-quality resource (75k+ examples) featuring segment-level review-rebuttal mappings, perspective labels, and impact categories. It is approximately 150x larger than previous discourse-structure datasets like DISAPERE.
Effective Training Pipeline: A novel combination of SFT and DPO that leverages the "natural reward" of author revisions to improve model performance without requiring a separate reward model.

4. Experimental Results

The model was evaluated on a test set from ICLR 2025 using both human experts and LLM-as-a-Judge (GPT-5-chat).

Baselines: Compared against SFT-only variants, large prompted LLMs (Llama-3.1-70B, GPT-5-chat, DeepSeek-V3.2), and specialized multi-agent methods (MARG, DeepReviewer).
Performance:
- Actionability: RBTACT achieved the highest scores in both human evaluation (3.46/5) and LLM-as-a-Judge evaluation (3.38/5), outperforming all baselines.
- Specificity: It also led in specificity (4.08/5 human, 3.70/5 LLM).
- Grounding & Relevance: It maintained parity with strong baselines on groundedness and relevance, proving that increasing actionability does not sacrifice accuracy.
- Pairwise Wins: In pairwise comparisons, RBTACT won against GPT-5-chat 57.1% of the time and against the SFT-only variant 76.2% of the time.
Efficiency: Despite being an 8B parameter model, RBTACT outperformed much larger models (32B–70B) and proprietary models on actionability, demonstrating the efficacy of the rebuttal-supervised training signal.

5. Significance

Paradigm Shift: The paper shifts the focus from treating rebuttals merely as data for analysis to using them as a supervision source for training. It demonstrates that the "outcome" of a review (did the author fix it?) is a powerful signal for learning what constitutes a good review.
Practical Impact: By generating feedback that is more actionable and specific, RBTACT has the potential to significantly improve the scientific workflow, helping authors revise papers more effectively and reducing the burden on human reviewers.
Generalizability: The approach suggests that implicit human feedback signals (like code commits, paper edits, or user corrections) can be leveraged to fine-tune LLMs for specific, high-value tasks beyond just peer review.

In conclusion, RBTACT successfully bridges the gap between AI-generated reviews and human expectations by learning from the concrete actions authors take in response to feedback, resulting in a model that provides targeted, implementable scientific guidance.