Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Here is an explanation of the paper "Traceable Evidence Enhanced Visual Grounded Reasoning" using simple language, creative analogies, and metaphors.

The Big Picture: "Thinking with Images" vs. "Guessing with Words"

Imagine you are taking a test where you have to look at a very crowded, messy room and answer a tricky question about it.

Old AI Models are like students who are great at memorizing textbooks but terrible at looking at the room. When asked, "What color is the tiny blue button on the left shoe of the person in the back row?", they might guess "Blue" because they've seen that phrase in their training data, even if the button is actually red or the person isn't there. They are hallucinating based on text patterns.
The New Goal (OpenAI-o3 style): We want AI that can actually look at the room, point to the specific shoe, zoom in on the button, and then answer. This is called "Visual Grounded Reasoning" or "Thinking with Images."

The problem? Until now, we didn't have a good way to test if the AI was actually looking or just guessing.

Part 1: The New Test (TreeBench)

The Problem: Existing tests were too easy or didn't check how the AI got the answer. It was like grading a math test only on the final number, without checking if the student actually did the work or just copied the answer from the back of the book.

The Solution: TreeBench (The "Detective's Notebook")
The authors created a new, super-hard test called TreeBench. Think of it as a "Detective's Exam" for AI.

The Scene: They use photos of incredibly busy, cluttered scenes (like a busy street or a crowded market) with thousands of tiny objects.
The Task: The AI has to find a specific, tiny detail (like a "pink shoe" or a "broken bottle") hidden in the mess.
The Twist (Traceable Evidence): The AI isn't allowed to just say "Pink." It must draw a box around the object it found before it gives the answer.
- Analogy: Imagine a detective solving a crime. They can't just say "The butler did it." They have to point to the specific fingerprint on the gun and say, "I found this here, so the butler did it."
The Difficulty: Even the smartest AI models (like OpenAI-o3) failed this test, scoring less than 60%. They often pointed at the wrong object or couldn't find the tiny detail at all.

Why it matters: This test proves that current AI is still bad at "looking" closely and connecting what it sees to what it says.

Part 2: The New Training Method (TreeVGR)

The Problem: How do we teach an AI to stop guessing and start looking? Previous methods taught the AI to just get the right answer. If the AI guessed the right answer but pointed at the wrong object, it still got a "Good Job!" sticker. This reinforced bad habits.

The Solution: TreeVGR (The "Strict Coach")
The authors built a new training system called TreeVGR. Think of this as a strict coach who doesn't care if you get the answer right unless you show your work.

The "Cold Start" (Learning the Rules): First, they taught the AI how to draw boxes and write down its thoughts, just like a student learning how to format an essay.
The "Reinforcement Learning" (The Reward System): This is the magic part. The AI plays a game where it gets points (rewards) for two things:
- Accuracy: Did it get the right answer?
- The "Box Score" (IoU): Did it draw the box in the exact right spot?
- Analogy: Imagine a game of "Hot and Cold." If the AI draws a box around the whole room, it gets zero points. If it draws a box around the specific tiny button, it gets a huge reward. If it draws a box around the wrong button, it gets punished.

The Result:
By forcing the AI to be precise with its "boxes" (evidence), the AI learned to actually look at the image before answering.

Before: The AI was like a student who memorized the answer key.
After (TreeVGR): The AI is like a student who actually studied the textbook and can explain why the answer is correct.

The Takeaway

This paper introduces two things that change the game:

TreeBench: A "hard mode" exam that forces AI to prove it can see tiny details in messy scenes, not just guess based on words.
TreeVGR: A new way to train AI that acts like a strict teacher, rewarding the AI only when it points to the right evidence before giving an answer.

The Bottom Line: To make AI truly "smart" about the visual world, we can't just ask it questions; we have to force it to show us its work. If it can't point to the evidence, it hasn't really understood the picture.

Here is a detailed technical summary of the paper "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method":

1. Problem Statement

Recent Large Multimodal Models (LMMs), such as OpenAI-o3, have demonstrated "thinking with images" capabilities by dynamically referencing visual regions during reasoning. However, the field lacks a comprehensive benchmark to evaluate these capabilities holistically. Existing benchmarks (e.g., POPE, MMBench, V* Bench) suffer from three main limitations:

Lack of Fine-Grained Localization: They often overlook subtle targets in complex scenes.
Absence of Traceable Evidence: They evaluate final answers but do not verify the intermediate reasoning steps (e.g., whether the model correctly identified the relevant object before answering).
Limited Second-Order Reasoning: They focus on simple "what/where" queries rather than complex spatial hierarchies, object interactions (occlusion, contact), and perspective transformations.

Consequently, current models struggle to be evaluated on their ability to perform Visual Grounded Reasoning (VGR)—a process requiring precise localization followed by logical inference.

2. Methodology

The paper introduces two core components: TreeBench (a diagnostic benchmark) and TreeVGR (a training paradigm).

A. TreeBench: Traceable Evidence Evaluation Benchmark

TreeBench is designed to evaluate "thinking with images" based on three principles:

Focused Visual Perception: Identifying subtle, small targets in cluttered, real-world scenes.
Traceable Evidence: Requiring models to output bounding boxes for target instances, enabling the evaluation of reasoning chains, not just final answers.
Vision-Centric Second-Order Reasoning: Testing complex interactions like occlusion, spatial containment, and perspective shifts.

Construction:
- Data Source: 1,000 high-resolution images sampled from SA-1B, prioritizing scenes with dense objects.
- Annotation: 8 LMM experts manually curated questions, options, and answers.
- Pipeline: A semi-automated pipeline used OpenAI-o3 and Gemini-2.5-Pro to generate initial questions, which were then refined, verified, and filtered by human experts.
- Difficulty Control: Questions answered correctly by all four SOTA models (Qwen2.5-VL-72B, InternVL3-78B, GPT-4o, Gemini-2.5-Flash) were excluded.
- Final Dataset: 405 challenging VQA pairs with accurate bounding box annotations for target instances.
- Categories: Divided into Perception (Attributes, Material, Physical State, Object Retrieval, OCR) and Reasoning (Perspective Transform, Ordering, Contact & Occlusion, Spatial Containment, Comparison).

B. TreeVGR: Traceable Evidence Enhanced Visual Grounded Reasoning

TreeVGR is a two-stage training framework designed to teach models to localize regions first and then answer, supervised by traceable evidence.

Cold-Start Initialization (SFT):
- Addresses the computational inefficiency of direct RL.
- Uses a curated dataset (35K samples) derived from VGR-158K, converted to absolute coordinates to match the base model (Qwen2.5-VL-7B).
- Includes a "reflective" subset where models are trained to detect and correct erroneous bounding boxes.
Reinforcement Learning with Traceable Evidence:
- Utilizes Group Relative Policy Optimization (GRPO).
- Reward Design: The total reward ( $R$ $R$ ) combines three components:
  - $R_{acc}$ : Accuracy reward (correct final answer).
  - $R_{format}$ : Formatting reward (correct XML tags for reasoning and answers).
  - $R_{IoU}$ : Dual IoU Reward, the core innovation. It calculates the average of Recall (ensuring every ground-truth box is matched) and Precision (ensuring every predicted box matches a ground truth). This prevents the model from enumerating excessive boxes to game the recall metric.
- Training Data: 37K samples combining hard samples from V* Bench and high-resolution detection data from VisDrone.

3. Key Contributions

TreeBench Benchmark: The first benchmark explicitly designed to evaluate "thinking with images" with traceable evidence. It exposes the limitations of current SOTA models (e.g., OpenAI-o3 scores only 54.87%), proving that even advanced models struggle with fine-grained, multi-step visual reasoning.
Dual IoU Reward Mechanism: A novel RL reward function that jointly optimizes precision and recall for bounding box generation, ensuring the model learns to provide accurate and complete visual evidence rather than guessing or over-generating.
Efficient Training Paradigm: TreeVGR achieves state-of-the-art performance without the massive computational cost of previous methods (e.g., DeepEyes required 32 H100 GPUs for 48 hours; TreeVGR uses a more efficient cold-start + RL pipeline).
Explainability: By forcing the model to generate bounding boxes, the reasoning pathway becomes transparent, allowing for the diagnosis of whether errors stem from mislocalization or logical failure.

4. Results

On TreeBench:
- TreeVGR-7B (initialized from Qwen2.5-VL-7B) achieves 50.4% accuracy, significantly outperforming the base model (37.0%) and other open-source VGR models.
- It achieves a 13.4% improvement over the base model on TreeBench and a 24.2% improvement in mIoU (localization accuracy).
- It performs comparably to the massive InternVL3-78B (45.5% vs 46.4%), demonstrating the efficacy of the training method over sheer model scale.
- OpenAI-o3 scores 54.87%, and Gemini-2.5-Pro scores 54.6%, highlighting the benchmark's difficulty.
On Other Benchmarks:
- V Bench:* TreeVGR-7B reaches 91.1%, a +16.8 improvement over the base model, setting a new open-source SOTA.
- MME-RealWorld-Lite: Achieves 54.9%, a +12.6 improvement.
- General Benchmarks: Shows consistent gains on MMVP, CV-Bench, and MMBench, proving that traceable grounding improves general multimodal understanding.
Ablation Studies:
- Removing the Dual IoU reward (using only text-based RL) leads to a significant drop in performance and mIoU.
- Removing the Precision term causes the model to enumerate excessive boxes (repetition problem) and fail to answer.
- Removing the Recall term leads to incomplete localization.

5. Significance

This paper establishes a new standard for evaluating and training Visual Grounded Reasoning.

For Evaluation: TreeBench moves the field beyond simple accuracy metrics, introducing traceability as a critical dimension. It reveals that current models often rely on language biases rather than genuine visual understanding.
For Training: TreeVGR demonstrates that explicitly supervising the localization step via Reinforcement Learning (specifically the Dual IoU reward) is essential for unlocking complex reasoning capabilities. It proves that "grounding-then-answering" is a viable and superior paradigm for multimodal reasoning, offering a blueprint for developing models that can truly "think with images."

The code and benchmark are open-sourced, facilitating further research into transparent and reliable multimodal AI.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

The Big Picture: "Thinking with Images" vs. "Guessing with Words"

Part 1: The New Test (TreeBench)

Part 2: The New Training Method (TreeVGR)

The Takeaway

1. Problem Statement

2. Methodology

A. TreeBench: Traceable Evidence Evaluation Benchmark

B. TreeVGR: Traceable Evidence Enhanced Visual Grounded Reasoning

3. Key Contributions

4. Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers