When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Imagine you are teaching a robot artist how to dress a virtual mannequin. You give the robot a photo of a person and a photo of a shirt, and you ask it to put the shirt on the person.

Sometimes, the robot does a great job. Sometimes, it makes a mess: the sleeves are the wrong length, the pattern is blurry, or the person's face disappears.

For years, researchers tried to teach robots by showing them the "Perfect Answer." They would say, "Here is the ideal photo of the person wearing the shirt. If your photo looks like this, you get a gold star." This works great for math or coding, where there is only one right answer.

But in the real world (like fashion), there isn't just one "perfect" photo. The shirt could be draped loosely, tightly, or in the sunlight. There are a million ways to do it right. So, trying to find the "Perfect Answer" to teach the robot is like trying to teach someone to swim by showing them a single, specific wave. It's too limiting.

This paper introduces a smarter way to teach the robot: Don't ask what's right; ask what's wrong.

The Core Idea: Counting Mistakes Instead of Checking Boxes

The authors call their method Implicit Error Counting (IEC). Here is how it works, using a simple analogy:

1. The Old Way: The "Rubric" (The Checklist)

Imagine a teacher grading a student's essay.

The Problem: The teacher tries to write a checklist based on a "perfect essay" that doesn't exist. They write: "Must have 5 paragraphs," "Must use the word 'however'."
The Result: The student writes a beautiful essay with 4 paragraphs and uses "nevertheless" instead. The teacher marks it down because it didn't follow the checklist, even though the essay is great.
In the Paper: This is called Rubrics as Rewards (RaR). It fails in virtual try-on because the "perfect outfit" varies too much. The checklist becomes too generic or too strict.

2. The Failed New Way: "Explicit Error Counting" (The Loud Critic)

The authors tried a different approach: "Just list every single mistake you see!"

The Problem: Imagine a critic who is very moody.
- Monday: You wear a red shirt. The critic says, "No mistakes!" (Score: 10/10).
- Tuesday: You wear the exact same red shirt. The critic says, "The red is slightly too bright. That's a mistake." (Score: 8/10).
The Result: The robot gets confused. It thinks the shirt changed, but it didn't. The critic's mood swings (or "noise") make the robot's training unstable. It's like trying to learn to walk while someone keeps changing the rules of gravity.

3. The Winning Way: "Implicit Error Counting" (The Wise Judge)

This is the paper's solution. Instead of asking the critic to list the mistakes, they ask the critic to feel the mistakes and give a score.

The Analogy: Imagine a judge at a gymnastics competition. They don't need to shout out, "You missed a point on the left foot, and your arm was 2 degrees off!"
How it works: The judge looks at the performance, counts the errors in their head, weighs how bad they are (a missing sleeve is a huge error; a tiny wrinkle is a small error), and simply gives a score: "8.5 out of 10."
Why it wins: The judge is consistent. Even if they describe the error differently in their head, the final score is stable. The robot learns: "Okay, I need to get closer to 10.0."

The Secret Sauce: Group Calibration

There's one more trick. Sometimes, a task is just harder than others.

Scenario: It's easier to put a shirt on a mannequin standing still than on a mannequin doing a backflip.
The Fix: The paper uses Group Calibration. Imagine the judge grades 12 robots at once. Instead of giving everyone an absolute score, the judge says, "Out of this group of 12, Robot A is the best, Robot B is the worst."
This ensures the robot isn't punished just because the task was hard; it's only punished if it did worse than its peers.

The Results: Why This Matters

The authors tested this on a new, very difficult dataset called MDressBench. They specifically picked pairs of clothes that were totally different (e.g., a short-sleeve shirt vs. a long-sleeve dress) to make the task hard.

The Outcome: The "Mistake Counter" (IEC) beat every other method.
The Metaphor: If the other methods were like a student trying to memorize a textbook, the IEC method was like a student who learned by playing the game, failing, and fixing their own mistakes.
Efficiency: It was also twice as fast because it didn't need to generate a long checklist first; it just gave the score.

Summary

When you can't define what "perfect" looks like, stop trying to find it. Instead, define what "bad" looks like, count the bad things, and use that to guide improvement.

The Paper's Mantra: "When you can't define what an ideal output looks like, define what a bad one looks like, and count."

1. Problem Statement

The paper addresses a critical gap in Reinforcement Learning (RL) post-training for generative models: the reference-free setting.

The Limitation of Current Methods: Existing approaches like Reinforcement Learning with Verifiable Rewards (RLVR) and Rubrics as Rewards (RaR) rely on the existence of a single "ideal" reference answer to generate evaluation criteria.
- RLVR works well for domains with clear correctness signals (e.g., math, code).
- RaR synthesizes rubrics from an ideal reference to score subjective tasks.
The Challenge: Many real-world tasks (e.g., virtual try-on, creative design, embodied control) admit multiple valid outputs and lack a single ground-truth ideal. In these scenarios, "correctness" is defined not by matching a reference, but by the absence of specific errors (e.g., wrong sleeve length, texture artifacts, identity drift).
The Gap: Attempting to generate rubrics without an ideal reference leads to generic criteria that miss prompt-specific failure modes, or overly specific criteria that penalize valid variations. The authors identify that in these settings, enumerating errors is more stable and structured than enumerating successes.

2. Methodology: Implicit Error Counting (IEC)

The authors propose Implicit Error Counting (IEC), a reward framework designed for reference-free RL. Instead of asking a judge to verify if an output matches a checklist, IEC asks the judge to enumerate what is wrong.

Core Components:

Implicit vs. Explicit Enumeration:
- Explicit Error Counting (EEC): The judge lists specific errors (e.g., "missing sleeve," "blurry texture"). The paper finds this too noisy for stable optimization because slight surface-level variations in the judge's language lead to different error counts for near-identical images, causing high variance and rank inversions.
- Implicit Error Counting (IEC): The judge internally counts errors and weights them by severity, outputting only a calibrated scalar score (0–1) per evaluation axis. The judge also provides a short text summary for interpretability, but the reward is derived from the internal score, not the parsed text. This stabilizes the reward signal while preserving the conceptual grounding of error detection.
Evaluation Axes:
For Virtual Try-On (VTO), the model evaluates five specific axes:
- Garment Transfer (placement, length, proportions)
- Attribute Preservation (color, pattern, texture)
- Realism (drape, boundaries, halos)
- Lighting Consistency
- Source Integrity (face, hair, background unchanged)
Group Calibration:
To mitigate scale drift between different prompts, the authors apply a group-wise calibration step using robust statistics (Median Absolute Deviation). This transforms raw scores into a normalized range, preserving within-group ordering while reducing prompt-to-prompt variance.
Training Algorithm:
The method uses Group Relative Policy Optimization (GRPO) adapted for flow-matching models. For each conditioning input, $K$ candidates are generated, and advantages are computed based on the group-normalized IEC rewards.

Evaluation Metric: Cascaded Error Counting (CEC)

To evaluate models without ground truth, the authors introduce Cascaded Error Counting (CEC).

Mechanism: It builds a shared, evolving error vocabulary across a group of candidates.
Process:
1. Pool Phase: The judge identifies errors for a batch of images and merges them into a canonical pool.
2. Verification Phase: Images are re-evaluated conditioned on this pool to ensure missed errors are caught and labels are consistent.
Result: This reduces variance caused by label drift and provides a consistent, fine-grained metric for comparing model performance.

3. Key Contributions

Identification of the Reference-Free Gap: The paper formally defines the setting where multiple valid outputs exist and ideal references are unavailable, arguing that error enumeration is the superior signal for RL in these domains.
Implicit Error Counting (IEC): A novel reward design that outperforms both direct holistic scoring and rubric-based methods by leveraging implicit error aggregation.
Cascaded Error Counting (CEC): A robust evaluation metric that aligns closely with human preferences (60% top-1 accuracy vs. 30% for baselines) without requiring ground-truth images.
Mismatch-DressCode (MDressBench): A new benchmark curated to stress-test reward designs by selecting source-reference pairs with maximal attribute disagreement (e.g., short-sleeve source vs. long-sleeve reference), exposing failure modes often hidden in standard benchmarks.

4. Experimental Results

The authors validated IEC on Virtual Try-On (VTO) using the MDressBench, VITON-HD, and DressCode datasets.

Performance on MDressBench:
- IEC outperformed RaR and Direct Scoring across all 8 metrics (including CEC, Garment Transfer, Realism, etc.) on both flat and non-flat references.
- Key Stat: On non-flat references, IEC improved CEC by 5.96% over RaR and 3.88% on the RaR score itself.
- Surprising Finding: RaR actually performed worse than Direct Scoring on CEC in the reference-free setting, confirming that constructing rubrics without ideal answers can introduce noise that degrades performance.
Explicit vs. Implicit:
- EEC failed to improve over the baseline and caused early training regression due to high variance.
- IEC showed smooth, monotonic improvement.
Generalization:
- On standard benchmarks (VITON-HD, DressCode), IEC matched or surpassed six supervised baselines trained on significantly larger paired datasets, achieving top scores on 6 of 8 perceptual metrics (LPIPS, SSIM, FID, KID) with only 60 RL steps and no additional paired data.
Efficiency:
- IEC requires 1 judge call per candidate.
- RaR requires 2 judge calls (rubric generation + evaluation).
- IEC achieves better performance with 50% less compute cost.

5. Significance and Conclusion

The paper demonstrates that when ideal answers are unavailable, counting errors provides a stronger, more stable signal than constructing rubrics.

Paradigm Shift: It moves the focus from "how well does this match the ideal?" to "how many specific failures does this contain?"
Practical Impact: The IEC framework offers a computationally efficient and robust method for post-training generative models in subjective, open-ended domains like image editing, robotics, and creative design.
Design Insight: The study highlights that implicit aggregation of error signals is critical for stability, as explicit enumeration introduces noise that destabilizes RL optimization.

In summary, the authors propose a "define what a bad output looks like" approach, which proves superior to "define what a good output looks like" in complex, multi-modal generation tasks where ground truth is ambiguous or non-existent.