EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Imagine you are a master chef trying to teach a robot how to cook. You give the robot a recipe: "Turn this plain steak into a juicy, medium-rare masterpiece with a side of asparagus."

Sometimes, the robot does a great job. Sometimes, it burns the steak to a crisp, or it accidentally turns the asparagus into a pile of rocks. To get better, the robot needs a taste-tester (a reward model) to tell it, "Good job!" or "Try again."

The problem is, most of the taste-testers currently available are either:

Too robotic: They check if the steak is the right temperature but don't care if it tastes good.
Too biased: They are trained by other robots, so they just copy each other's mistakes.
Too expensive: The best human taste-testers are rare and slow.

This paper introduces EDITREWARD, a new, super-smart taste-tester specifically designed for image editing. Here is how it works, broken down into simple concepts:

1. The "Taste-Test" Dataset (EDITREWARD-DATA)

Before building the new taste-tester, the authors needed a massive library of "good" and "bad" examples.

The Old Way: They usually asked random people on the internet to grade images. This is like asking a crowd of people to judge a Michelin-star meal; some might like it, some might hate it, and many might just be guessing. The data is "noisy."
The New Way: The authors hired expert chefs (trained human annotators) to look at thousands of image edits.
The Twist: They didn't just give a single score (like "5 out of 10"). They gave two separate scores:
1. Did it follow the instructions? (Did it actually change the steak to medium-rare?)
2. Is it pretty? (Does it look delicious and realistic, or like a plastic toy?)
The Result: They created a library of 200,000 carefully graded examples. This is the "Gold Standard" dataset.

2. The New Taste-Tester (EDITREWARD Model)

Using this high-quality library, they trained a new AI model called EDITREWARD.

How it thinks: Instead of just guessing, it looks at the original image, the instruction, and the result. It asks itself two questions: "Did the robot listen?" and "Does it look good?"
Handling Confusion: Sometimes, an image is perfect at following instructions but looks a bit weird, or looks great but missed a detail. Old models get confused by this. EDITREWARD is smart enough to say, "Okay, it's a trade-off," and give a nuanced score. It even learns from "ties" (when two images are equally good) by analyzing why they are good in different ways.

3. The "Taste-Test" Challenge (EDITREWARD-BENCH)

To prove their new taste-tester is the best, they created a new, harder exam called EDITREWARD-BENCH.

The Exam: Instead of just comparing two images (A vs. B), they sometimes show three or four images at once and ask the model to rank them perfectly.
The Score: EDITREWARD beat the current champions (like GPT-4o and GPT-5) on this exam. It aligns much better with what actual humans think is a "good" edit.

4. The Real-World Test: Cleaning Up a Messy Kitchen

The most exciting part of the paper is what they did with this new taste-tester.

The Problem: There is a huge, messy dataset of image edits called "ShareGPT-4o-Image." It has 46,000 examples, but many are garbage (bad edits, wrong instructions). Training a robot on this mess makes the robot bad.
The Solution: They used EDITREWARD to act as a filter. It looked at all 46,000 examples and picked out only the top 20,000 high-quality ones.
The Result: They took a robot (Step1X-Edit) and trained it only on those 20,000 clean examples.
- Before: The robot was okay (Score: 6.4/10).
- After: The robot became amazing (Score: 7.1/10).
- The Lesson: It's better to train on 20,000 perfect examples than 46,000 messy ones. EDITREWARD successfully cleaned the kitchen so the robot could learn properly.

Summary

Think of EDITREWARD as a super-strict, highly trained art critic who has read every book on image editing.

It learned from 200,000 carefully graded examples.
It is better at judging art than the current "famous" AI critics.
Most importantly, it can clean up messy training data, helping open-source image editing tools catch up to the expensive, closed-source giants (like the ones from OpenAI or Google).

The authors are releasing this "critic," the "library of graded art," and the "exam" to the public, hoping to help everyone build better image-editing robots in the future.

1. Problem Statement

Instruction-guided image editing has seen significant progress in closed-source models (e.g., GPT-Image-1, Seedream), but open-source models lag behind. The primary bottleneck is the lack of a reliable, human-aligned reward model capable of scaling high-quality synthetic training data.

Limitations of Existing Rewards: Current methods rely on:
- Perceptual scores (e.g., LPIPS): Fail to capture semantic alignment with user instructions.
- Feature scores (e.g., CLIP): Fail to capture specific editing semantics.
- VLM-as-a-Judge: General-purpose Vision-Language Models (VLMs) are not optimized for editing tasks and often show weak alignment with human preference.
- Existing Datasets: Often rely on noisy, crowd-sourced annotations or pseudo-labels from proprietary models, leading to inconsistency and bias.

2. Methodology

The authors propose a three-part solution: a high-fidelity dataset, a specialized reward model architecture, and a rigorous evaluation benchmark.

A. EDITREWARD-DATA (Dataset Construction)

Scale & Source: A large-scale dataset comprising ~200K preference pairs. It aggregates 9,557 instruction-image pairs from six established benchmarks (e.g., GEdit-Bench, MagicBrush, EmuEdit).
Generation: For each instruction, 12 candidate images were generated using six state-of-the-art models (Step1X-Edit, Flux-Kontext, Qwen-Image-Edit, etc.) with multiple random seeds to avoid model bias.
Annotation Protocol:
- Expert Annotation: Trained experts followed a rigorous protocol, scoring 7 randomly sampled candidates per instruction.
- Multi-Dimensional Scoring: Unlike single-score schemes, annotators rated images on a 4-point Likert scale across two distinct dimensions:
  1. Instruction Following (IF): Semantic accuracy, completeness, and exclusivity (no unprompted changes).
  2. Visual Quality (VQ): Plausibility, artifact-free rendering, and aesthetics.
- Quality Control: High Inter-Annotator Agreement (IAA) was achieved (Krippendorff's $\alpha \approx 0.67$ for IF, $0.60$ for VQ), validating the reliability of the data.

B. EDITREWARD (Model Architecture & Training)

Backbone: The model utilizes powerful VLM backbones (Qwen2.5-VL-7B or MiMo-VL-7B) as feature extractors.
Multi-Head Reward Head: Instead of a single scalar score, the model employs a Multi-Task Learning (MTL) approach with separate heads for the two dimensions (IF and VQ).
Uncertainty-Aware Modeling: Inspired by HPSv3, the model predicts a Gaussian distribution ( $s \sim \mathcal{N}(\mu, \sigma^2)$ ) for each dimension rather than a deterministic score. This captures the inherent uncertainty in human judgments, particularly for ambiguous cases.
Loss Functions:
- Multi-Dimensional Uncertainty-Aware Ranking Loss: Aggregates the predicted means ( $\mu$ ) of the two dimensions (using strategies like Mean, Min, or Sum) to compute preference probabilities between image pairs.
- Tie Disentanglement: A novel data augmentation strategy where "tied" pairs (equal overall score) are decomposed into two training samples based on dimensional strengths (e.g., Image A wins on IF, Image B wins on VQ), forcing the model to learn nuanced trade-offs.

C. EDITREWARD-BENCH (Evaluation Benchmark)

Design: A new benchmark derived from the dataset's held-out test split, featuring multi-way preference ranking (ternary and quaternary tuples) rather than simple pairwise comparisons.
Difficulty: Includes challenging cases with small score differences to test fine-grained discriminative power.
Ground Truth: Annotated by three independent experts with cross-checking to ensure robust labels.

3. Key Contributions

EDITREWARD-DATA: The release of a large-scale (200K), high-fidelity preference dataset for instruction-guided image editing, distinguished by expert manual annotation and multi-dimensional scoring (IF and VQ).
EDITREWARD: A state-of-the-art reward model trained on this dataset that demonstrates superior alignment with human preferences compared to general-purpose VLMs and proprietary models.
EDITREWARD-BENCH: A challenging new benchmark featuring multi-way preference tasks that provide a more robust evaluation of reward model ranking consistency.
Data Curation Framework: Demonstration of using the reward model to filter noisy datasets, significantly improving downstream model performance.

4. Experimental Results

Benchmark Performance:
- GenAI-Bench: EDITREWARD (MiMo-VL backbone) achieved 65.72%, outperforming GPT-5 (59.61%) and ADIEE (59.96%).
- AURORA-Bench: Scored 63.62%, significantly beating OpenAI-GPT-4o (50.81%).
- EDITREWARD-BENCH: Achieved 38.42% overall accuracy, outperforming specialized models like Gemini-2.5-Flash (38.02%) and GPT-5 (37.81%).
- ImagenHub: Achieved a Spearman correlation of 36.18, competitive with top proprietary systems.
Downstream Application (Data Curation):
- The authors used EDITREWARD to select the top 20K high-quality samples from the noisy 46K ShareGPT-4o-Image dataset.
- Fine-tuning Step1X-Edit on this curated subset improved the overall GEdit-Bench score from 6.7 (full noisy set) to 7.1 (curated subset), making it competitive with top-tier models like Doubao-Edit.
- This confirms that data quality curated by a reliable reward model is more impactful than sheer data volume.
Ablation Studies:
- Pairwise uncertainty loss significantly outperformed point-wise regression.
- Multi-head architectures (separate heads for IF and VQ) outperformed shared-head architectures.
- The "Mean" aggregation strategy for multi-dimensional scores yielded the best results.

5. Significance

Bridging the Gap: This work provides a critical tool to close the performance gap between open-source and proprietary image editing models by enabling the scaling of high-quality training data.
Methodological Advancement: It establishes that modeling human preference in image editing requires multi-dimensional, uncertainty-aware approaches rather than holistic single scores, acknowledging the trade-off between instruction fidelity and visual realism.
Community Resource: By releasing the dataset, model, and benchmark under a CC-BY-NC-SA license, the authors empower the research community to build better, more aligned image editing systems.
Practical Utility: The successful application in data curation proves that reward models are not just evaluators but essential components for training the next generation of generative AI.