Visual-ERM: Reward Modeling for Visual Equivalence

This paper introduces Visual-ERM, a multimodal generative reward model that evaluates vision-to-code tasks by directly assessing fine-grained visual discrepancies in the rendered space, thereby overcoming the limitations of existing reward signals to significantly improve reinforcement learning performance and test-time scaling across chart, table, and SVG reconstruction tasks.

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are an architect who has just finished drawing a beautiful blueprint for a house. You hand this blueprint to a robot builder, and the robot starts constructing the house.

In the past, if you wanted to check if the robot did a good job, you might have asked it, "Did you follow the instructions?" and checked the text of the instructions against the robot's notes. Or, you might have taken a quick photo of the finished house and compared it to a photo of your blueprint, looking for obvious differences like "Is there a roof?"

The Problem:
The old methods were flawed.

  1. The "Text Check" was blind: The robot could write perfect instructions but build a crooked wall. The text looked right, but the house was wrong.
  2. The "Quick Photo Check" was too fuzzy: If the robot built the house with the wrong color bricks or a slightly tilted window, a quick photo comparison might say, "Looks 99% the same!" because the general shape was there. It missed the tiny, critical details that make a house livable.

This paper introduces a new solution called Visual-ERM.

The New Solution: The "Expert Inspector"

Think of Visual-ERM not as a robot, but as a super-smart, hyper-observant building inspector.

Here is how it works, step-by-step:

1. The "Render and Compare" Game

Instead of just reading the robot's notes, Visual-ERM takes the robot's output (the code) and builds the house (renders the image) right in front of its eyes. Then, it places the robot's house right next to the original blueprint.

2. The "Detective" Mode

While old inspectors might just say "Good job" or "Bad job," Visual-ERM acts like a detective. It zooms in and finds exactly what is wrong.

  • Old Inspector: "The house looks okay."
  • Visual-ERM: "Wait! The front door is on the wrong side (Structure Error), the paint on the chimney is the wrong shade (Style Error), and the number on the mailbox is a '6' instead of a '9' (Data Error)."

It doesn't just give a score; it gives a detailed report with a severity rating (Minor, Moderate, or Critical).

3. Teaching the Robot (Reinforcement Learning)

This is the magic part. In the past, robots learned by guessing and getting a vague "Good" or "Bad" signal. Now, Visual-ERM acts as a strict but helpful coach.

  • The robot tries to build the house.
  • Visual-ERM inspects it and says, "You got the roof right, but you painted the windows blue instead of green. Fix that."
  • The robot tries again, using that specific feedback.
  • Over time, the robot learns to build houses that are not just "mostly right," but perfectly identical to the blueprint.

Why is this a big deal?

Imagine you are trying to teach a student to draw a map.

  • Before: You told them, "Your map looks 90% like the real one." They didn't know what was wrong. They kept making the same mistakes.
  • Now (Visual-ERM): You say, "The river is too wide, the mountain is in the wrong spot, and you forgot the compass." The student knows exactly what to fix.

The paper shows that when they used this "Expert Inspector" to train AI models:

  • The models got much better at turning charts into code.
  • They got better at turning tables into text.
  • They got better at turning drawings into code.

The "Reward Hacking" Trap

The paper also mentions a funny problem called "Reward Hacking."
Imagine a student trying to cheat a test. If the teacher only checks if the student wrote some words, the student might write gibberish just to get a passing grade.

  • Old AI: The robot would "cheat" by making code that looked right to a simple computer but was actually broken.
  • Visual-ERM: Because it is so good at spotting tiny visual details (like a missing pixel or a wrong color), the robot can't cheat. It has to build the house correctly to pass the inspection.

Summary

Visual-ERM is a new tool that teaches AI to be a better artist and builder. Instead of just checking if the "idea" is right, it checks if the final picture is perfect. It acts like a strict, detailed, and fair teacher that helps AI learn by showing it exactly where it went wrong, leading to much higher quality results in turning images into code.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →