From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

Imagine you have a very smart, well-read robot assistant (a Vision-Language Model) that is incredibly good at looking at pictures and describing what it sees. You ask it to read a sign in a photo, and it usually gets it right.

However, there's a catch. Because this robot is trained to be a "creative writer," it sometimes gets too confident. If the text in the photo is blurry or hard to read, the robot might just guess based on what sounds right, rather than what is actually there. It might say "The bank is open" when the sign actually says "The bank is closed," or it might invent a whole paragraph of text that isn't in the picture at all.

In the world of AI, this is called hallucination. For a creative writing bot, that's fine. But for a tool meant to read documents, receipts, or street signs, making things up is dangerous.

This paper introduces a new system called the Geometric Risk Controller (GRC). Think of it as a strict quality-control manager that stands between the creative robot and the user.

Here is how it works, using simple analogies:

1. The Problem: The "Confident Guessing" Robot

Imagine you are trying to read a smudged receipt.

The Old Way: You ask the robot once. It looks at the smudge, thinks, "Hmm, that looks like an 'O' followed by a 'K' because 'O' and 'K' are common," and confidently types "OK." It's wrong, but it sounds plausible.
The Risk: The robot prioritizes sounding right over being right.

2. The Solution: The "Panel of Judges" (Multi-View Probing)

Instead of asking the robot just once, the GRC asks it five times, but with a twist.

The Analogy: Imagine you are trying to identify a blurry face in a crowd. Instead of looking at one photo, you take five photos of the same person, but you shift the camera slightly left, right, zoom in a tiny bit, and zoom out a tiny bit.
The Process: The GRC takes the original image and creates 5 slightly different "views" of it (cropping it differently, shifting it slightly). It asks the robot to read the text from all 5 views.

3. The "Reality Check" (Structural Screening)

Before the robot even tries to guess the words, the GRC checks the geometry.

The Analogy: If the sign in the photo is only 2 inches wide, and the robot tries to write a sentence that is 20 inches long, the GRC immediately stops it. It's like a bouncer at a club checking IDs: "You don't fit in this space; you can't get in."
This filters out obvious nonsense, like the robot inventing long paragraphs that don't fit the picture.

4. The "Consensus Vote" (Cross-View Agreement)

This is the most important part. The GRC looks at the 5 answers the robot gave.

Scenario A (The Good Case): The robot says "OPEN" in all 5 views. The GRC says, "Great, everyone agrees. This is safe to show the user."
Scenario B (The Bad Case): In 3 views, the robot says "OPEN," but in the other 2, it says "OPFN" and "OPEM." The GRC sees this disagreement. It thinks, "Wait, the robot is confused. It's not sure. I shouldn't show this to the user."
The Result: Instead of showing a wrong answer, the GRC says, "I abstain." It admits, "I don't know, and it's better to say nothing than to lie."

5. The "Dial" (Operating Points)

The system has a dial (called $m$ ) that the user can turn.

Loose Setting: "Let's be lenient. If 2 out of 5 judges agree, we'll show the answer." (High coverage, slightly more risk).
Strict Setting: "Let's be super strict. We need 5 out of 5 judges to agree perfectly." (Lower coverage, but almost zero risk of lying).

Why This Matters

Previously, if an AI model got 95% accuracy on a test, we thought it was safe to use. But that 5% failure rate might be the most dangerous part (e.g., reading a "Stop" sign as "Go").

This paper shows that by adding this Quality Control Layer, we can:

Catch the lies: Stop the robot from making up text.
Know when to quit: Have the robot say "I don't know" instead of guessing.
Keep the good stuff: Still read most of the text correctly.

In short: The paper teaches us that for AI to be truly reliable in the real world, we shouldn't just make the AI smarter. We need to build a safety net that checks its work, forces it to agree with itself, and stops it from showing us answers when it's just guessing.

1. Problem Statement

The paper addresses a critical deployment misalignment in using frozen Vision-Language Models (VLMs) as generative Optical Character Recognition (OCR) engines.

The Core Conflict: VLMs are trained via autoregressive decoding to maximize semantic plausibility (generating text that makes sense linguistically). However, OCR requires geometric verifiability (text that is strictly supported by visual evidence in the image).
The Failure Mode: Under open-ended decoding, VLMs often produce "hallucinations" such as:
- Over-generation: Outputting text longer than what exists in the image.
- Unsupported Substitutions: Replacing visual characters with semantically plausible but visually incorrect ones (e.g., changing "0" to "O").
The Limitation of Current Metrics: Traditional benchmarks (like ICDAR or OCRBench) focus on average-case accuracy (e.g., Character Error Rate). These metrics fail to capture rare but catastrophic failures that pose significant risks in real-world deployment.
The Goal: The authors propose shifting the focus from "improving the model" to "controlling the deployment." They aim to create a system that explicitly decides whether to accept a transcription or abstain (refuse to output) based on verifiable risk, rather than relying on the model's internal confidence.

2. Methodology: Geometric Risk Controller (GRC)

The authors propose a model-agnostic framework called the Geometric Risk Controller (GRC). It operates as an external layer on top of a frozen VLM without modifying the model's weights.

A. Multi-View Geometric Probing

Instead of querying the model once, the GRC probes the same input image with $K$ geometrically related views (e.g., slight translations, crop jitter, scale variations).

Rationale: If the model is truly grounded in the visual evidence, its output should remain stable across these minor geometric perturbations. If the output relies on language priors (hallucination), it will likely vary significantly across views.

B. Evidence Processing Pipeline

The system processes the $K$ outputs through three stages:

Canonicalization: Normalizes outputs (e.g., whitespace, case) to ensure comparability.
Structural Screening: Applies lightweight, label-agnostic constraints to reject degenerate outputs immediately.
- Key Constraint: Geometric Length Bound. The output string length is checked against an estimated upper bound derived from the foreground geometry of the specific view. If the text is too long for the visual space, it is rejected.
Cross-View Consensus & Stability:
- Consensus ( $s^*$ ): Identifies the unique mode (most frequent string) among the valid views.
- Vote Fraction ( $q$ ): The proportion of valid views agreeing with the mode.
- Dispersion ( $\Delta$ ): The average normalized edit distance between the valid views and the consensus string.

C. The Accept/Abstain Decision

The system accepts a transcription only if it satisfies a strict Operating Point defined by two thresholds:

Consensus Threshold ( $\tau$ ): The vote fraction $q$ must be high enough.
Stability Threshold ( $\kappa$ ): The dispersion $\Delta$ must be low enough.
Minimum Valid Views ( $K_{min}$ ): A minimum number of views must pass the structural screen.

If these criteria are not met, the system abstains (outputs $\perp$ ), preventing the user from seeing a potentially hallucinated result.

3. Key Contributions

Reframing the Problem: The authors recast frozen VLM OCR as a deployment-control problem. They introduce "geometric verifiability" and "deployment-oriented risk primitives" to expose long-tail failures that average accuracy hides.
The Geometric Risk Controller (GRC): A novel, model-agnostic framework that converts open-ended generation into a selective prediction system with an explicit "strictness knob" ( $m$ ). This allows operators to tune the trade-off between coverage (how many images get an answer) and risk (how safe those answers are).
Empirical Validation: Extensive experiments across multiple frozen VLM backbones (LLaVA-Phi3, Gemma3, GLM-OCR) and standard benchmarks (IIIT5K, ICDAR 2013) demonstrate that GRC significantly reduces catastrophic errors while maintaining high coverage.

4. Experimental Results

The paper evaluates GRC against a standard "always-accept" baseline and a confidence-threshold baseline.

Risk Reduction: GRC consistently reduces Mean CER (Character Error Rate) and, more importantly, the Upper-Tail CER (P99) and Meltdown@2 (a metric for catastrophic failure where error rate > 200%).
- Example: On LLaVA-Phi3 with IIIT5K, the baseline had a Meltdown@2 of 33.7‰. With GRC ( $m=3$ ), this dropped to 0.3‰, while maintaining ~89.5% coverage.
Comparison to Confidence Baselines: A standard confidence-threshold baseline (using internal token probabilities) failed to suppress catastrophic errors as effectively as GRC. This proves that external multi-view consistency is a superior signal for detecting hallucinations than internal model confidence.
Operating Points: By adjusting the strictness index $m$ $m$ , the system traces a clear risk-coverage frontier.
- Loose setting ( $m=1$ ): High coverage, moderate risk reduction.
- Strict setting ( $m=5$ ): Lower coverage, near-zero catastrophic risk.
Ablation Studies: Removing either the structural screening or the cross-view consensus significantly degraded performance, proving that both components are necessary and complementary.

5. Significance and Limitations

Significance:

Deployment Safety: The paper provides a practical blueprint for deploying generative AI in safety-critical domains (like OCR) where hallucinations are unacceptable. It shifts the paradigm from "making the model smarter" to "making the system safer."
Auditability: The accept/abstain decision is based on explicit, auditable rules (consensus and geometry) rather than opaque internal model states, making the system's behavior predictable and contractible.
Cost-Efficiency: The method works with frozen models, avoiding the high cost of fine-tuning large VLMs for specific OCR tasks.

Limitations:

Stable-but-Wrong Failures: The system cannot detect errors where the model consistently hallucinates the same wrong word across all views (stable consensus). GRC detects instability, not correctness.
Scope: Currently limited to word-level scene text. Extending to document-level OCR or complex layouts would require more sophisticated region-level verification.
Inference Cost: The multi-view approach increases inference time (e.g., $K=5$ implies ~4.5x latency), though the paper suggests $K=5$ is a practical default balancing cost and safety.

In conclusion, the paper argues that for generative perception systems, reliability depends on explicit control over when outputs are exposed, rather than just improving the backbone model's average accuracy. The GRC offers a robust, auditable mechanism to achieve this.