A Rubric-Supervised Critic from Sparse Real-World Outcomes

Imagine you are teaching a robot assistant how to write code.

The Problem: The "Exam" vs. The "Real Job"

In school (or academic benchmarks), we teach robots by giving them a test. If the robot writes code that passes the test, it gets an "A." If it fails, it gets an "F." This is easy to measure: the test either passes or it doesn't.

But in the real world, coding isn't a multiple-choice exam. It's more like a collaborative art project between a human and the robot.

The human says, "Build me a house."
The robot builds a shed.
The human says, "No, I wanted a garage."
The robot builds a garage, but forgets the door.
The human fixes the door, and eventually, they live there.

In this real-world scenario, there is no single "pass/fail" button. The success signal is sparse (rare), delayed (you only know it worked weeks later when the code is merged), and noisy (the human might be grumpy even if the code is good, or happy even if the code is messy).

Because we don't have a clear "A" or "F" for every step the robot takes, we can't easily teach it how to get better. We are flying blind.

The Solution: The "Rubric-Supervised Critic"

The authors of this paper propose a solution: The Critic.

Think of the Critic not as a teacher giving a final grade, but as a sharp-eyed film director sitting on the set, watching the robot work in real-time.

1. Breaking the Movie into Scenes (Segments)

Instead of watching the whole movie (the entire conversation) and guessing if it was good, the Critic breaks the interaction into small scenes (called "segments").

Scene 1: User asks for a function. Robot writes it.
Scene 2: User asks to fix a bug. Robot fixes it.
Scene 3: User asks to add a feature. Robot adds it.

2. The "Rubric" (The Checklist)

Here is the magic trick. The Critic doesn't just wait for the final "Pass/Fail" result (which might take days to arrive). Instead, it uses a 24-point checklist called Critic Rubrics.

These rubrics are like a director's notes on behavior, not just the final product. For example, the Critic looks for:

"Did the robot misunderstand the user's intent?"
"Did the robot skip testing its own work?"
"Did the robot get stuck in a loop doing the same mistake?"
"Did the user sound frustrated?"

Why is this cool?
Even if we don't know if the final project was a success, we can look at the behavior in the scene and say, "Hey, the robot skipped testing. That's a bad habit."
This gives us dense feedback. We can critique every single scene, not just the ones where we happen to know the final outcome.

3. The Training: Learning from Sparse Clues

The Critic is trained using two types of signals:

The Sparse Clue (The "Real World" Signal): Sometimes, we know the final result (e.g., "The code was merged into the company's main system"). This is rare (only 4% of the time), but it's the "truth."
The Dense Clue (The "Rubric" Signal): We use the 24-point checklist to critique every single interaction.

The Critic learns to connect the dots: "Ah, when the robot shows 'Insufficient Testing' and 'Misunderstood Intention' on the checklist, it usually leads to a failed project later."

By learning these patterns, the Critic becomes an expert at predicting success, even when it hasn't seen the final result yet.

What Can This Critic Do?

Once trained, this Critic becomes a superpower for the robot in three ways:

The "Best of 8" Filter (Inference-Time Scaling):
Imagine the robot tries to solve a problem 8 different ways. The Critic watches all 8 attempts and says, "Attempt #3 looks promising; Attempt #7 is going to fail because the robot is ignoring instructions."
- Result: We pick the best attempt immediately. This improved success rates by 15.9% in their tests.
The "Stop Button" (Early Stopping):
If the Critic sees the robot making a mistake (like "Loop Behavior" or "Risky Actions"), it can say, "Stop! This path is doomed."
- Result: We save massive amounts of computer power (83% less computing) by not wasting time on bad attempts.
The "Curator" (Training Data Selection):
When we want to train the robot further, we don't just feed it random conversations. We ask the Critic: "Which conversations were actually good examples?"
- Result: The robot learns from the best examples, not just random noise.

The Big Takeaway

The paper solves a major problem: How do you teach a robot when you don't have a clear answer key?

By creating a Critic that watches for bad habits (the Rubrics) in real-time, we can turn messy, real-world interactions into a structured learning experience. It's like having a coach who doesn't just wait for the game to end to say "Good job," but stops the play every 5 seconds to say, "You're holding the ball wrong," ensuring the team improves with every single practice.

In short: They taught a robot to judge its own behavior using a detailed checklist, allowing it to learn from real-world chaos just as well as it learns from perfect textbook tests.

1. Problem Statement

Current academic benchmarks for coding agents (e.g., SWE-bench) rely on verifiable, autonomous rewards (like unit-test pass rates) to measure success. However, real-world software engineering agents operate in human-in-the-loop settings where:

Success is Sparse: Users rarely provide explicit feedback; success signals (e.g., Pull Request merges) are infrequent.
Success is Delayed: Feedback often arrives only after the interaction concludes, not during the process.
Success is Noisy: A merged PR does not guarantee the code is perfect (it may be reverted later), and user ratings are subjective.
Credit Assignment is Hard: In multi-turn conversations, it is difficult to determine which specific agent actions led to a final outcome.

Existing evaluators trained on benchmark data fail to generalize to real-world scenarios because they overfit to unit-test logic rather than human-centric behaviors like maintainability and intent alignment.

2. Methodology

The authors propose a framework to learn a "Critic" model (a learned evaluator) from sparse real-world interaction traces by combining dense behavioral supervision with sparse outcome proxies.

A. Data Structuring: Segments

The authors decompose multi-turn human-agent trajectories into Segments.

A segment is a minimal, self-contained unit of work starting with a user request and ending with an agent finish action.
This allows for granular credit assignment, breaking down a long conversation into discrete, evaluable units.

B. Outcome Proxies (Sparse Signals)

To ground the learning in real-world success, the authors define a hierarchy: PR $\to$ Commits $\to$ Segments.

PR Merge: A binary signal indicating if a Pull Request was merged. It is coarse and noisy (applies to all segments in the PR).
Code Survival: A continuous signal (0.0 to 1.0) measuring the fraction of code written by a specific segment that persists in the final merged diff. This provides finer-grained credit assignment but is available for only ~4% of segments.

C. Critic Rubrics (Dense Signals)

To overcome label sparsity, the authors introduce Critic Rubrics, a taxonomy of 24 behavioral features derived directly from the interaction trace (without needing outcome labels). These are observable in 100% of segments and include:

Agent Behavioral Issues: Misunderstood intention, insufficient testing, loop behavior, scope creep, risky actions, etc.
User Follow-up Patterns: Clarification requests, corrections, frustration, reversion requests.
Infrastructure Issues: External failures vs. agent-caused failures.

These rubrics are annotated using a frontier LLM (o3) as an external observer, creating a dense supervision signal for all 154K+ segments.

D. Model Training: Semi-Supervised Multi-Task Learning

The Critic model (initialized from Qwen3-4B) is trained using a semi-supervised objective:

Task 1 (Dense): Predict the 24 rubric features for all segments.
Task 2 (Sparse): Predict the success probability (PR merge or Code Survival) for the small subset of labeled segments.

This approach transforms the 96% of unlabeled real-world data into informative training examples by leveraging the dense rubric signals to learn robust behavioral representations.

3. Key Contributions

Critic Rubrics Framework: A novel, scalable supervision framework using 24 behavioral features to annotate agent interactions, enabling learning from unlabeled data.
Semi-Supervised Critic Training: A method to jointly learn rubric prediction and outcome prediction, bridging the gap between sparse real-world outcomes and dense process supervision.
Segment-Based Credit Assignment: A formalization of multi-turn interactions into segments, allowing for precise attribution of success/failure to specific agent actions.
Open-Source Release: The authors release the OpenHands-Critic-4B model, the rubric definitions, and the code for segment extraction.

4. Experimental Results

The authors evaluated the Critic model on SWE-bench and real-world production data.

Generalization Gap: Critics trained only on benchmark data (SWE-Gym) performed worse than random (AUC ~0.45–0.48) on real-world outcomes, confirming that unit-test success does not correlate with real-world utility.
Superiority of Code Survival: Training on Code Survival (continuous) yielded better discrimination (AUC 0.69) than PR Merge (AUC 0.58), as it captures partial successes and reduces noise from human edits.
Inference-Time Scaling (Best-of-K):
- Using the rubric-supervised critic for Best-of-8 reranking on SWE-bench improved the pass rate by +15.9% over random selection.
- It enabled Early Stopping, reducing the average number of attempts by 83% (from ~8 to ~1.35) while maintaining high success rates.
Cross-Backbone Robustness: Rubric-supervised critics generalized well across different LLM backbones (Claude Sonnet vs. Opus), whereas "Success-Only" critics overfitted to specific backbones.
Training-Time Data Curation: Using critic scores to select data for Supervised Fine-Tuning (SFT) improved the base model's solve rate by +1.2%, outperforming random selection and approaching the performance of data filtered by ground-truth outcomes.

5. Significance and Impact

Bridging the Simulation-Reality Gap: This work demonstrates that to build effective real-world agents, evaluation must move beyond unit tests to include human-centric behavioral signals.
Efficiency: The ability to stop unsuccessful agent trajectories early saves significant computational resources (83% reduction in attempts).
Scalable Supervision: The rubric-based approach solves the "label scarcity" problem in RLHF and agent training, allowing models to learn from the vast majority of interactions that lack explicit success/failure labels.
Actionable Feedback: The critic provides not just a score, but interpretable reasons (via rubrics) for failure, aiding in debugging and iterative improvement of agents.

In conclusion, the paper establishes that process supervision (rubrics) combined with sparse outcome proxies is the key to training evaluators that generalize from benchmarks to the messy reality of human-agent software development.