Visual-ERM: Reward Modeling for Visual Equivalence

Imagine you are an architect who has just finished drawing a beautiful blueprint for a house. You hand this blueprint to a robot builder, and the robot starts constructing the house.

In the past, if you wanted to check if the robot did a good job, you might have asked it, "Did you follow the instructions?" and checked the text of the instructions against the robot's notes. Or, you might have taken a quick photo of the finished house and compared it to a photo of your blueprint, looking for obvious differences like "Is there a roof?"

The Problem:
The old methods were flawed.

The "Text Check" was blind: The robot could write perfect instructions but build a crooked wall. The text looked right, but the house was wrong.
The "Quick Photo Check" was too fuzzy: If the robot built the house with the wrong color bricks or a slightly tilted window, a quick photo comparison might say, "Looks 99% the same!" because the general shape was there. It missed the tiny, critical details that make a house livable.

This paper introduces a new solution called Visual-ERM.

The New Solution: The "Expert Inspector"

Think of Visual-ERM not as a robot, but as a super-smart, hyper-observant building inspector.

Here is how it works, step-by-step:

1. The "Render and Compare" Game

Instead of just reading the robot's notes, Visual-ERM takes the robot's output (the code) and builds the house (renders the image) right in front of its eyes. Then, it places the robot's house right next to the original blueprint.

2. The "Detective" Mode

While old inspectors might just say "Good job" or "Bad job," Visual-ERM acts like a detective. It zooms in and finds exactly what is wrong.

Old Inspector: "The house looks okay."
Visual-ERM: "Wait! The front door is on the wrong side (Structure Error), the paint on the chimney is the wrong shade (Style Error), and the number on the mailbox is a '6' instead of a '9' (Data Error)."

It doesn't just give a score; it gives a detailed report with a severity rating (Minor, Moderate, or Critical).

3. Teaching the Robot (Reinforcement Learning)

This is the magic part. In the past, robots learned by guessing and getting a vague "Good" or "Bad" signal. Now, Visual-ERM acts as a strict but helpful coach.

The robot tries to build the house.
Visual-ERM inspects it and says, "You got the roof right, but you painted the windows blue instead of green. Fix that."
The robot tries again, using that specific feedback.
Over time, the robot learns to build houses that are not just "mostly right," but perfectly identical to the blueprint.

Why is this a big deal?

Imagine you are trying to teach a student to draw a map.

Before: You told them, "Your map looks 90% like the real one." They didn't know what was wrong. They kept making the same mistakes.
Now (Visual-ERM): You say, "The river is too wide, the mountain is in the wrong spot, and you forgot the compass." The student knows exactly what to fix.

The paper shows that when they used this "Expert Inspector" to train AI models:

The models got much better at turning charts into code.
They got better at turning tables into text.
They got better at turning drawings into code.

The "Reward Hacking" Trap

The paper also mentions a funny problem called "Reward Hacking."
Imagine a student trying to cheat a test. If the teacher only checks if the student wrote some words, the student might write gibberish just to get a passing grade.

Old AI: The robot would "cheat" by making code that looked right to a simple computer but was actually broken.
Visual-ERM: Because it is so good at spotting tiny visual details (like a missing pixel or a wrong color), the robot can't cheat. It has to build the house correctly to pass the inspection.

Summary

Visual-ERM is a new tool that teaches AI to be a better artist and builder. Instead of just checking if the "idea" is right, it checks if the final picture is perfect. It acts like a strict, detailed, and fair teacher that helps AI learn by showing it exactly where it went wrong, leading to much higher quality results in turning images into code.

1. Problem Statement

The paper addresses the challenges in Vision-to-Code tasks, where models must reconstruct structured visual inputs (charts, tables, SVGs) into executable code or structured text (e.g., Markdown, Python, SVG code) with high visual fidelity.

While Supervised Fine-Tuning (SFT) has yielded results, Reinforcement Learning (RL) is hindered by the lack of reliable reward signals. Existing reward mechanisms suffer from two primary limitations:

Text-Based Metrics (e.g., Edit Distance, TEDS): These operate purely in the textual domain, ignoring critical visual cues like alignment, spacing, and layout. They are prone to "reward hacking," where a model generates text that scores well textually but renders visually incorrect images.
Vision-Encoder Similarity (e.g., DINO): These rely on coarse-grained feature embeddings. While they capture global semantic similarity, they fail to detect fine-grained discrepancies (e.g., specific data value errors, text typos, or subtle layout shifts) and are often insensitive to the precise structural fidelity required for parsing tasks.

The core problem is the absence of a reward model that can provide fine-grained, interpretable, and task-agnostic feedback by evaluating the output directly in the rendered visual space.

2. Methodology: Visual-ERM

The authors propose Visual-ERM (Visual Equivalence Reward Model), a multimodal generative reward model designed to evaluate vision-to-code quality by comparing the Ground Truth (GT) image against the Rendered Prediction image.

A. Core Architecture & Training Pipeline

Data Generation:
- The training data consists of image pairs: $(I_{GT}, \hat{I}_{pred})$ .
- $\hat{I}_{pred}$ is generated by rendering the output of a vision-to-code model (or a corrupted version of the GT text).
- Errors are injected via two methods: (1) Edit: Perturbing GT text with strong models to create controlled errors; (2) Infer: Sampling natural errors from weaker models.
Fine-Grained Annotation (Distillation):
- To create high-quality supervision, the authors use a distillation pipeline. They leverage a strong proprietary model (GPT-5-mini) to analyze image pairs and generate structured annotations.
- Annotations include specific error categories, locations, descriptions, and severity scores (1-3).
- This process creates a dataset of ~340k instances across Charts, Tables, and SVGs.
Model Training:
- Visual-ERM is built upon Qwen3-VL-8B-Instruct.
- It is trained via Supervised Fine-Tuning (SFT) using a negative log-likelihood objective to predict the structured discrepancy annotations given the image pair.
- Unlike discriminative reward models that output a single scalar, Visual-ERM is generative, outputting a JSON object detailing specific errors.

B. Integration into RL and Test-Time Scaling

Reinforcement Learning (RL): Visual-ERM serves as the reward function in a GRPO-based RL pipeline.
- The model generates code $\to$ renders an image $\to$ Visual-ERM compares it to GT $\to$ outputs a discrepancy list.
- The reward is calculated based on the sum of severity scores of detected errors, normalized to a [0, 1] range.
- A "Render-Success Reward" (RSR) is added to ensure the code is executable.
Test-Time Scaling (TTS): Visual-ERM enables iterative self-refinement. The model generates an initial output, Visual-ERM critiques it with specific feedback, and the model revises the output based on this feedback, repeating the process to improve accuracy.

C. Benchmark: VC-RewardBench

To evaluate reward models directly, the authors introduce VisualCritic-RewardBench (VC-RewardBench).

Content: 1,335 high-quality instances of image pairs with human-verified fine-grained annotations.
Evaluation: Uses an LLM-as-Judge to match predicted discrepancies against ground truth, calculating Precision, Recall, and F1 scores, as well as correlation for severity scoring.

3. Key Contributions

Visual-ERM Model: A multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback. It moves beyond scalar scores to provide diagnostic reasoning (e.g., "Y-axis label mismatch," "Bar color incorrect").
Systematic Analysis: Demonstrates that text-based metrics and vision-encoder similarity are insufficient for vision-to-code RL due to modality bias and lack of interpretability.
VC-RewardBench: A new benchmark specifically for evaluating fine-grained image-to-image discrepancy judgment in structured visual data.
Unified Framework: Shows that a single reward model can effectively improve performance across diverse tasks (Chart-to-Code, Table-to-Markdown, SVG-to-Code) and both RL training and test-time scaling.

4. Experimental Results

The paper evaluates Visual-ERM on three major tasks using Qwen3-VL-8B-Instruct as the policy backbone.

Chart-to-Code (ChartMimic):
- Visual-ERM improved the base model by +8.4 average points.
- It significantly outperformed DINO-based RL (+4.9 vs +8.4 gain).
- Even when starting from a strong SFT baseline (VinciCoder), Visual-ERM provided consistent gains (+10.1).
Table-to-Markdown (OmniDocBench, olmOCRBench):
- Text-based (TEDS) and DINO-based rewards showed performance degradation or minimal gains due to reward hacking and modality mismatch.
- Visual-ERM achieved a consistent +2.7 average improvement, successfully balancing text recognition and structural layout.
SVG-to-Code (UniSVG):
- Visual-ERM yielded +4.1 average improvement.
- Unlike DINO-based rewards (which degraded the stronger VinciCoder model), Visual-ERM improved both weak and strong backbones, demonstrating robustness.
Benchmark Performance (VC-RewardBench):
- Visual-ERM (8B) decisively outperformed Qwen3-VL-235B-Instruct and approached leading closed-source models (e.g., GPT-5, Gemini-3) in detecting fine-grained discrepancies.
- It achieved an average F1 score of 44.7 (soft match) compared to 32.4 for the 235B model.
Test-Time Scaling:
- Integrating Visual-ERM into a reflection loop further boosted performance, adding +3.1 to +8.0 points depending on the baseline and task.

5. Significance

Paradigm Shift: The paper argues that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity. It proves that evaluating the rendered result is superior to evaluating the text or embeddings.
Efficiency: It demonstrates that a specialized 8B model (Visual-ERM) can outperform massive 235B general-purpose models in specific reward modeling tasks, suggesting that targeted training is more effective than brute-force scaling for reward models.
Interpretability: By generating structured error reports rather than scalar scores, Visual-ERM enables Test-Time Scaling via reflection, a capability that scalar rewards cannot support.
Generalizability: The approach is task-agnostic, successfully unifying the reward signal for charts, tables, and vector graphics, simplifying the pipeline for complex multimodal generation tasks.