Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Imagine you are a detective trying to solve a complex mystery based on a stack of old, blurry documents, a few confusing charts, and a series of photos.

In the world of Artificial Intelligence, current "smart" models (like the ones you chat with) are often like confident detectives who guess. They look at the evidence, make a quick guess about what a blurry letter says, write it down, and then build their entire theory on that guess. If they got that first letter wrong, their whole theory collapses, but they won't admit it—they'll just confidently explain why their wrong guess makes sense. This is called "hallucination."

The paper you shared introduces a new system called Proof-of-Perception (PoP). Think of PoP not as a single detective, but as a highly organized, cautious investigation team with a strict set of rules.

Here is how it works, broken down into simple concepts:

1. The "Safety Net" (Conformal Sets)

Instead of the detective saying, "I am 100% sure that letter is an 'A'," PoP says, "Based on the evidence, this letter is likely an 'A', but it could also be a '4' or a 'H'."

The Analogy: Imagine a fishing net. A normal AI casts a single hook and hopes it catches the right fish. PoP casts a net. It catches a small group of possible answers (a "set").
The Guarantee: The system has a mathematical promise (a "certificate") that says, "We are 90% sure the correct answer is inside this net." If the net is too small or the fish is too slippery, the system knows it's in trouble before it makes a final claim.

2. The "Step-by-Step" Map (The Graph)

Instead of rushing to the final answer, PoP breaks the problem down into a map of small tasks.

Step 1: Read the text (OCR).
Step 2: Find the specific object in the picture.
Step 3: Read the numbers on the chart.
Step 4: Combine these facts to answer the question.

At every single step, the team checks their "net." If the net for Step 1 (reading the text) is shaky, they don't move on to Step 4. They stay right there and fix Step 1.

3. The "Budget Manager" (The Controller)

This is the smartest part. Imagine you have a limited amount of money to spend on this investigation. You can't call every expert in the world; you have to be efficient.

The Normal Way: Most AI systems either stop too early (saving money but getting the answer wrong) or keep asking questions forever (getting the right answer but wasting time/money).
The PoP Way: The "Manager" looks at the safety nets.
- Scenario A: The net is tight and confident. "Great, we know this part. Let's move on." (Saves money).
- Scenario B: The net is loose and wobbly. "Uh oh, we aren't sure about this chart number. Let's spend extra money to get a higher-resolution photo and try again." (Spends money only when necessary).

This ensures the system is efficient. It doesn't waste energy on things it already understands, but it spends extra effort exactly where it is confused.

4. The "Devil's Advocate" (Self-Play)

To make sure the team is ready for anything, the system trains itself by playing a game against a "villain" version of itself.

The villain tries to trick the team by blurring the text, changing the fonts, or adding distracting objects to the photos.
The team learns to spot these tricks and adjust their "nets" to be wider when things look weird. This makes them very robust in the real world.

Why Does This Matter?

No More "Confident Wrongness": If the system isn't sure, it admits it by showing a range of possibilities or asking for more help, rather than lying.
Evidence-Based: Every answer comes with a "receipt" showing exactly which part of the image or text it used to find the answer. You can verify the work.
Cost-Effective: It uses computer power smarter, saving money and time by only digging deeper when absolutely necessary.

In a nutshell:
Proof-of-Perception turns AI from a confident guesser into a careful, evidence-checking accountant. It doesn't just give you an answer; it gives you a verified receipt, a safety net, and a plan to spend its energy only where it counts.

1. Problem Statement

Multimodal Large Language Models (MLLMs) struggle with complex visual reasoning tasks (e.g., document understanding, chart analysis, multi-image QA) because they typically entangle fine-grained perception (OCR, detection, parsing) with symbolic reasoning in a single forward pass. This leads to three critical issues:

Brittle Intermediates: Models commit to single-valued outputs (e.g., one OCR string, one bounding box) at intermediate steps. If an early perceptual error occurs, subsequent reasoning steps rationalize the error, leading to confident but unsupported answers (hallucinations).
Heuristic Compute Control: Existing tool-using agents (e.g., ReAct, Program-of-Thought) rely on fixed rules or uncalibrated thresholds to decide when to retry or expand computation, lacking principled accuracy-compute trade-offs.
Lack of Stepwise Reliability: Current calibration methods, if used, apply only to the final answer, leaving the sequence of intermediate perception and logic steps unverified.

2. Methodology: Proof-of-Perception (PoP)

PoP reframes multimodal reasoning as the execution of a Directed Acyclic Graph (DAG) where nodes represent perception or logic operations. The framework integrates Conformal Prediction (CP) to provide statistical guarantees at every step.

A. Core Components

Reasoning Graph Representation:
- The MLLM acts as a planner, generating a DSL (Domain Specific Language) program that defines a DAG.
- Tool Nodes: Call external tools (OCR, detectors, chart parsers) to produce structured outputs.
- Fusion Nodes: Operate within the MLLM to aggregate intermediate results and the original query.
- Answer Node: Produces the final output.
Node-Level Conformal Prediction:
- Instead of a point prediction, each node $v$ of type $t$ outputs a conformal set $\Gamma^{(t)}_\delta(x)$ , a set of candidate outputs.
- Nonconformity Score: A learned function $s^{(t)}(x, z)$ measures how "strange" a candidate $z$ is relative to the input $x$ .
- Thresholding: Using a split-conformal approach on a calibration dataset, a threshold $\tau^{(t)}_\delta$ is determined. The output set includes all candidates where $s^{(t)}(x, z) \le \tau^{(t)}_\delta$ .
- Guarantee: Under exchangeability, the true output is guaranteed to be in the set with probability $1-\delta$ (e.g., 90% coverage).
Adaptive Controller:
- A lightweight policy $\pi_\phi$ observes the certificate state (set size, dispersion, type) and a global computation budget.
- Actions:
  - ACCEPT: Proceed if the set is small and confident.
  - RETRY: Re-run the node with higher fidelity (e.g., higher resolution crop).
  - EXPAND: Add new child nodes (e.g., parallel tools or sub-region analysis) to resolve ambiguity.
  - ABORT: Stop early if the query is deemed unanswerable within the budget.
- This turns uncertainty into an active compute policy, allocating resources only where needed.
Self-Play Counterexample Mining:
- To ensure robustness against distribution shifts (e.g., font changes, clutter), the system uses a student-adversary loop. The adversary generates perturbed inputs and "hard" cases, which are added to the calibration pool. This ensures the conformal thresholds and controller policies remain reliable under realistic shifts.

B. Training Objective

The model is trained jointly to minimize:

Task Loss: Standard cross-entropy or regression loss on the final answer.
Certificate Loss: Ensures nonconformity scores for true labels fall below the learned thresholds.
Controller Loss: A reinforcement learning objective (policy gradient) that minimizes a cost function balancing accuracy (penalizing coverage violations) and computation cost (penalizing excessive tool calls).

3. Key Contributions

Compositional Conformal Guarantees: First framework to apply conformal prediction to intermediate nodes in a multimodal reasoning graph, providing stepwise uncertainty quantification rather than just final-answer calibration.
Certified Tool-Using Framework: Replaces heuristic retry mechanisms with a mathematically grounded controller that expands computation only when node-wise certificates indicate low confidence.
Error Reduction: By maintaining sets of candidates until evidence resolves ambiguity, PoP prevents early perceptual errors from cascading into hallucinated final answers.
Efficient Accuracy-Compute Trade-off: Demonstrates that adaptive allocation based on uncertainty yields higher performance per unit of compute compared to fixed-budget baselines.

4. Experimental Results

The authors evaluated PoP on DocVQA, TextVQA, InfographicVQA, ChartQA, and MultiDoc2Dial, comparing against strong baselines (Direct-MLLM, M-CoT, MM-ReAct, Program-of-Thought).

Performance & Reliability:
- PoP consistently outperformed baselines in Exact Match (EM) and F1 scores.
- Hallucination Reduction: Reduced hallucination rates by 27–45% compared to the strongest baselines (e.g., dropping from 17.9% to 7.1% on DocVQA).
Coverage Guarantees:
- Achieved target coverage (90%) across all node types (OCR, Detection, Chart Parsing, Logic) with empirical coverage ranging from 90.2% to 91.3%.
- Coverage remained stable even under synthetic distribution shifts (font swaps, affine transforms, clutter).
Compute Efficiency:
- PoP achieved the same or better accuracy as baselines using 25% less compute (fewer tool calls).
- The controller successfully halted expansion once certificates were satisfied, avoiding diminishing returns seen in other methods.
Ablation Studies:
- Removing conformal sets (No-CP) significantly degraded performance and increased hallucinations.
- Node-wise CP outperformed "Final-only" CP, proving the value of intermediate verification.
- Learned controllers outperformed fixed-heuristic controllers in budget efficiency.

5. Significance

Proof-of-Perception represents a paradigm shift in multimodal reasoning from "single-pass guessing" to "verifiable, evidence-grounded execution."

Trustworthiness: By providing calibrated uncertainty sets at every step, PoP allows systems to distinguish between "confident" and "verified" answers, crucial for high-stakes domains like medical or legal document analysis.
Scalability: The framework is model-agnostic and tool-agnostic, making it applicable to any MLLM architecture and existing perception tools.
Future Direction: It establishes a new standard for "Compositional Conformal Prediction," suggesting that future AI systems should not just output answers, but output certified reasoning traces with quantifiable reliability.

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

1. The "Safety Net" (Conformal Sets)

2. The "Step-by-Step" Map (The Graph)

3. The "Budget Manager" (The Controller)

4. The "Devil's Advocate" (Self-Play)

Why Does This Matter?

1. Problem Statement

2. Methodology: Proof-of-Perception (PoP)

A. Core Components

B. Training Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization