Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Imagine you are hiring a team of expert math tutors to solve a complex geometry problem based on a diagram. You ask them to write down their solution step-by-step so you can check their work.

In the past, the "judge" (an AI called a Process Reward Model) would just read the tutor's steps and give them a score. But here was the problem: The judge was bad at looking at the picture.

If the tutor made a mistake reading the diagram (e.g., "The circle has a radius of 5"), the judge might think the tutor was being clever and give them a high score. Or, if the tutor was right but the judge thought the radius was 3, the judge would unfairly punish the tutor. The judge couldn't tell the difference between a logic error (bad math) and a sight error (bad vision).

This paper introduces a new system called EVPV (Explicit Visual Premise Verification). Think of it as adding a specialized "Fact-Checker" assistant to the judging process.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Blind Judge"

Imagine a game show where a contestant solves a puzzle based on a picture.

The Old Way: The host (the AI Judge) reads the contestant's answer. If the contestant says, "The red ball is on the left," and the host thinks it's on the right, the host might mark the answer wrong. Or, if the contestant hallucinates a ball that isn't there, the host might accidentally agree because they are also confused by the picture.
The Result: Good logic gets punished, and bad logic gets rewarded. The system is unreliable.

2. The Solution: The "Fact-Checker" (EVPV)

The authors created a system that separates seeing from thinking.

Step A: The "Checklist" (The Policy)
Before the tutor (the AI solving the problem) writes their math, they are forced to fill out a Visual Checklist.

Analogy: Before solving the puzzle, the contestant must say: "I am looking at a red ball on the left, and a blue square on the right."
This forces the AI to explicitly state, "I am basing my next math step on this specific visual fact."

Step B: The "Independent Auditor" (The Constraint Extractor)
While the contestant is writing their checklist, a separate, independent robot (the Constraint Extractor) looks at the picture and creates a Master Fact Sheet.

Analogy: A second robot scans the image and writes down: "Fact: Red ball is at (x=10, y=5). Fact: Blue square is at (x=20, y=5)."
Crucially, this robot doesn't care about the math; it only cares about what is actually in the picture.

Step C: The "Match-Up" (Verification)
Now, the system compares the Checklist against the Master Fact Sheet.

Analogy: The host checks: "Did the contestant say the ball is on the left? Yes. Does the Master Fact Sheet say the ball is on the left? Yes." -> Match!
Analogy: "Did the contestant say the ball is on the right? Yes. Does the Master Fact Sheet say it's on the left? No." -> Mismatch!

3. The "Traffic Light" System (Reliability Gating)

This is the magic part. The system uses the result of the Match-Up to decide how much to trust the math.

Green Light (High Reliability): The checklist matches the facts perfectly. The system says, "Okay, the vision is clear. Now, let's judge the math logic strictly." If the math is wrong, it gets a bad score. If it's right, it gets a good score.
Red Light (Low Reliability): The checklist contradicts the facts (e.g., the AI claimed to see a "cylindrical hole" that doesn't exist). The system says, "Wait, the vision is broken! I cannot trust the math that follows this."
- Instead of giving a harsh "Wrong" score (which might be unfair if the math was actually correct but based on a bad sight), the system dampens the score. It essentially says, "I'm not sure if this is right or wrong because the starting point was a hallucination."

Why is this a big deal?

It stops the "Blind Judge" from making mistakes. It prevents the system from punishing a student for a math error when they actually just misread the picture.
It stops "Hallucination Rewards." It prevents the system from rewarding a student who makes up facts (like a "cylindrical hole") just because the math following it sounds smart.
It's Fast. Unlike other methods that require the AI to stop and use a tool to check the picture every single step (which is slow and expensive), this system checks the facts once at the beginning and then uses a simple "traffic light" to adjust the scores as it goes.

The Bottom Line

Think of EVPV as a quality control manager who realizes that you can't judge a recipe if the chef is using the wrong ingredients.

If the chef says, "I'm adding sugar," but the manager sees the chef grabbing salt, the manager doesn't just say "Good job" or "Bad job." The manager says, "Stop! You are using the wrong ingredient. I can't judge your cooking until you fix the ingredients."

By fixing the "ingredients" (the visual facts) first, the system ensures that the final score reflects true logic, not just a lucky guess or a visual mistake. This makes AI much more reliable when solving complex problems that involve both pictures and math.

1. Problem Statement

Multimodal Large Language Models (MLLMs) face a critical bottleneck in mathematical reasoning: the entanglement of visual perception (reading diagrams, OCR, geometry) and symbolic reasoning (logic, calculation).

The Black-Box Judge Issue: Current Vision-Language Process Reward Models (VL-PRMs) assign scores to intermediate reasoning steps but often function as black boxes. A low score is ambiguous: it could indicate a logical error or simply the verifier's own misperception of the image.
Systematic Errors: This ambiguity leads to:
- False Positives: Rewarding fluent steps based on hallucinated visual premises (e.g., assuming a "cylindrical hole" exists when it doesn't).
- False Negatives: Penalizing correct logical steps because the verifier misread the input image.
Consequence: These errors undermine Best-of- $N$ reranking and error localization, as the system cannot distinguish between a reasoning failure and a perception failure. Existing tool-integrated solutions (e.g., calling external tools per step) are too computationally expensive for large-scale test-time scaling.

2. Methodology: Explicit Visual Premise Verification (EVPV)

The authors propose EVPV, a lightweight framework that decouples perceptual uncertainty from logical evaluation. Instead of treating the visual premise as a given, EVPV explicitly verifies it before scoring the reasoning step.

Core Pipeline

Step-wise Visual Checklist:
- The policy model (the solver) is prompted to generate a reasoning trace where each step $s_t$ is accompanied by a visual dependency declaration ( $d_t$ ).
- If a step relies on visual data, it produces a natural language assertion (e.g., "The radius is 2"). This forms a Visual Checklist $V$ .
Structured Visual Constraint Extraction:
- A separate Constraint Extractor ( $E_\phi$ ) processes the input image and question once per instance to generate a structured set of visual facts $C$ (JSON format).
- These constraints cover:
  - Numeric: Lengths, angles, table values.
  - Relations: Parallel, perpendicular, equality, incidence.
  - Structure: Part-whole relationships, attachments, adjacency.
- Training: The extractor is distilled from a strong teacher model (Qwen3-VL) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to ensure high fidelity.
Consistency-to-Reliability Signal:
- The system matches the policy's Checklist against the extracted Constraints using a type-aware matching function.
- It computes a scalar Visual Reliability Score ( $r \in [0, 1]$ ).
- Aggregation: A geometric mean is used to aggregate individual claim scores. This ensures that a single catastrophic hallucination (a claim with $p_j \approx 0$ ) drastically lowers the overall reliability $r$ , reflecting the fact that one wrong premise invalidates the whole trace.
Reliability-Gated Reward Calibration:
- A standard step verifier ( $V_\theta$ ) produces a base reward ( $R_{base}$ ) based on logical correctness.
- Gating Mechanism: The final reward $R_t$ is modulated by the reliability signal $r$ :
  $R_t = \begin{cases} R_{base} & \text{if } \nu_t = 0 \text{ (no visual dependency)} \\ \alpha(r) \cdot R_{base} & \text{if } \nu_t = 1 \text{ (visual dependency)} \end{cases}$
- Where $\alpha(r)$ is a sigmoid function. If $r$ is low (unreliable premise), the reward is attenuated toward neutral (0), preventing the system from confidently rewarding or penalizing a step built on a hallucination.

3. Key Contributions

Decoupling Perception and Reasoning: EVPV introduces a mechanism to explicitly validate visual premises before evaluating logical steps, solving the "black box" ambiguity of current PRMs.
Lightweight Verification: Unlike tool-integrated methods that require expensive per-step API calls, EVPV extracts constraints once per problem and reuses them for all steps, making it scalable for Best-of- $N$ reranking.
Structured Evidence: The use of structured JSON constraints (numeric, relational, structural) rather than unstructured text descriptions significantly improves the precision of premise verification.
Causal Validation: The authors provide causal evidence that performance gains are driven by constraint fidelity, not just prompt engineering, by demonstrating monotonic performance degradation when constraints are artificially corrupted.

4. Experimental Results

The method was evaluated on VisualProcessBench (step-level verification) and six downstream multimodal reasoning benchmarks (e.g., MathVista, MMMU, MathVerse) using InternVL2.5 policies.

Step-Level Verification:
- EVPV-PRM achieved the highest Macro-F1 (67.46%) on VisualProcessBench, outperforming strong baselines like VisualPRM (62.00%) and TIM-PRM (61.70%).
- It significantly reduced false positives/negatives caused by visual misinterpretation.
Best-of- $N$ Reranking:
- EVPV consistently improved BoN@8 (Best-of-8) accuracy across all model scales (8B, 26B, 38B).
- For the 38B model, EVPV yielded a +9.78 point improvement over Pass@1, outperforming VisualPRM (+6.30) and the base policy.
- Gains were most pronounced on visually intensive benchmarks (MathVista, WeMath), confirming the method's effectiveness in handling visual bottlenecks.
Ablation Studies:
- Removing structured constraints (using captions only) dropped performance by ~4 points.
- Removing the visual modality entirely (text-only judge) caused a massive drop (~12.5 points), proving that structured constraints complement but do not replace direct image analysis.
- Causal Curve: Injecting noise into constraints caused a monotonic decline in performance, proving the gains are due to the quality of premise verification.

5. Significance and Impact

Reliability in Test-Time Scaling: EVPV enables more reliable selection of grounded solutions during test-time compute scaling (Best-of- $N$ ), a crucial requirement for deploying MLLMs in real-world scenarios.
Interpretability: By making visual premises explicit, the system provides a diagnostic interface to understand why a step was scored low (e.g., "The premise was hallucinated" vs. "The logic was flawed").
Efficiency: It offers a high-performance alternative to tool-based verification, avoiding the latency and cost of running external tools for every reasoning step.
Future Direction: The work highlights that improving the "grounding" of reward models is as important as improving the reasoning models themselves, paving the way for more robust multimodal agents.

In summary, EVPV transforms the PRM from a passive judge into an active verifier that checks the "facts" before grading the "logic," leading to more robust and reliable multimodal reasoning systems.

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

1. The Problem: The "Blind Judge"

2. The Solution: The "Fact-Checker" (EVPV)

3. The "Traffic Light" System (Reliability Gating)

Why is this a big deal?

The Bottom Line

1. Problem Statement

2. Methodology: Explicit Visual Premise Verification (EVPV)

Core Pipeline

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents