VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Imagine you are hiring a detective to solve a mystery. In the past, you'd give the detective a stack of text files and ask them to find the answer. But the real world isn't just text; it's full of pictures, maps, and visual clues.

This paper introduces a new, tougher test for AI detectives called VisBrowse-Bench. Here is the breakdown in simple terms:

1. The Problem: The "Lazy Detective"

Current AI models (the detectives) are great at reading text, but they often cheat when it comes to pictures.

The Old Way: If you showed an AI a picture of a building and asked, "What was this before it was a museum?", the AI would just take a photo of the building, run it through a "reverse image search" tool, and read the Wikipedia title that pops up. It didn't actually look at the building; it just used a tool to find the name.
The Flaw: Once the AI gets the name (e.g., "Sapporo Beer Museum"), it stops looking at pictures. It switches to a text-only search to find the rest of the answer. It treats the visual world like a text book, ignoring the fact that real websites are full of images, charts, and visual layouts that you have to see to understand.

The Analogy: It's like asking a student to solve a math problem by looking at a graph, but the student just reads the title of the graph and then closes their eyes to do the math in their head. They miss the actual data.

2. The Solution: A New "Visual-Native" Test

The authors created VisBrowse-Bench, a new exam designed to force the AI to actually use its eyes throughout the whole process.

The Setup: They created 169 tricky questions. Each question starts with a picture, but the answer isn't in the picture's caption.
The Trap: To get the answer, the AI has to:
1. Look at the first picture to find a clue.
2. Go to a website.
3. Find a new picture on that website that holds the next clue.
4. Look at that new picture to find a specific detail (like the color of a tie or the number on a jersey).
5. Go to another website, find another picture, and so on.

The Analogy: Imagine a scavenger hunt where you can't just read the clues; you have to find specific objects hidden in photos on different websites. If you close your eyes and just read the text, you fail. You have to keep your "eyes" open the whole time.

3. The New "Toolbelt"

To help the AI, the researchers gave it a special digital toolbelt with five tools:

Text Search: Like Googling words.
Image Search: Like finding pictures based on a description.
Reverse Image Search: Like uploading a photo to find where it came from.
Image Crop: Like using a magnifying glass to zoom in on a tiny part of a photo.
Webpage Visit: Like opening a link to read the page.

The goal was to see if the AI could pick the right tool at the right time and switch between looking at text and looking at pictures seamlessly.

4. The Results: The AI is Still Struggling

The researchers tested the smartest AI models in the world (like Claude, GPT, and Gemini) on this new test.

The Score: Even the best AI (Claude-4.6-Opus) only got about 47.6% of the answers right. That's barely a "C" grade.
The Surprise: A specialized "Deep Research" model (o3) only got 41.1%.
Why? The AI models are still too comfortable with text. When they get stuck, they often try to guess or rely on what they "remembered" from their training data instead of going back to look at the pictures again. They struggle to connect the dots between a picture on Page A and a picture on Page B.

5. A Real-Life Example from the Paper

The Task: "In this picture, a person is holding a magic wand. Find the movie poster for the first film in this series. Who is the character below her, and what race are they?"
The AI's Journey:
1. Look at the first image $\rightarrow$ Realize it's Hermione from Harry Potter.
2. Search for the first movie poster $\rightarrow$ Find the poster.
3. Look at the poster $\rightarrow$ See a giant character below her.
4. Search for that character $\rightarrow$ Realize it's Hagrid.
5. Search for Hagrid's race $\rightarrow$ Answer: "Half-giant."
The Failure: Many AIs got stuck at step 3 or 4. They found the movie but couldn't "see" the character below Hermione in the poster, or they couldn't link the visual of the character to the text "Hagrid."

The Big Takeaway

This paper is a wake-up call. We are building AI agents that can browse the web, but we are testing them like they are just reading books. The real internet is a visual place. To build truly smart AI, we need to stop letting them "cheat" by just reading text and force them to learn how to look, zoom, and reason with pictures just as humans do.

The authors have made their test and their code public so other scientists can try to build better "visual detectives."

1. Problem Statement

Despite rapid advancements in Multimodal Large Language Models (MLLMs) and agent technologies, existing benchmarks for "deep research" and web browsing suffer from two critical limitations that prevent accurate evaluation of real-world capabilities:

Insufficient Visual Reasoning Evaluation: Current benchmarks (e.g., MMSearch, BrowseComp-VL) often allow models to bypass visual reasoning by using reverse image search tools to extract semantic text from images immediately. This reduces the task to simple text-based tool invocation rather than requiring fine-grained visual understanding.
Neglect of Native Visual Information: In real-world browsing, the reasoning chain often requires discovering and processing new visual evidence found on web pages (e.g., specific details in a chart, a person's clothing in a photo, or spatial relationships). Existing benchmarks often degenerate into "text-only" deep search once an initial entity is identified, failing to test the agent's ability to actively seek, ground, and reason over visual data throughout the search trajectory.

The core problem is the lack of a benchmark that forces agents to perform visual-native search, where visual understanding is structurally indispensable and cannot be substituted by textual descriptions or simple tool lookups.

2. Methodology

A. VisBrowse-Bench Dataset Construction

The authors constructed a new benchmark, VisBrowse-Bench, containing 169 VQA instances across 7 domains (Media, Life, Art, Geography, Technology, Sport, and Finance).

Design Principles:
- Multimodal Information Integration: Questions require interleaved text and visual processing. Evidence acquisition must involve active visual search, not just text retrieval.
- Visual Competency Enforcement: Visual information is designed to be "structurally indispensable." It includes spatial grounding (e.g., "the person to the right"), attribute perception (e.g., specific colors or patterns), and relational parsing that cannot be described textually.
Data Pipeline:
- Multi-hop Reasoning: Experts construct reasoning chains with at least 3 hops and 2 distinct visual evidence blocks.
- Iterative Traversal: The process involves Entity $\to$ Event $\to$ Visual Information $\to$ New Entity, ensuring no single-hop shortcuts exist.
- Validation: Rigorous multi-layer verification by domain experts ensures answer uniqueness, solvability, and that visual evidence is publicly accessible.

B. Agentic Framework

To evaluate models on this benchmark, the authors proposed a specific Agentic Workflow equipped with five core tools:

Text Search: Standard natural language query retrieval.
Image Search: Retrieving images based on text queries.
Reverse Image Search: Finding similar images based on an image URL.
Image Crop: Extracting specific regions of an image for localized reasoning.
Webpage Visit: Visiting a URL to extract structured information (using an LLM to summarize/reason).

The workflow operates as a closed loop: The MLLM parses the query, formulates a strategy, calls tools based on evidence gaps, processes results, and refines the query until sufficient evidence is gathered for answer synthesis.

3. Key Contributions

Formalization of Visual-Native Search: The paper identifies and formalizes the gap in current benchmarks regarding the evaluation of visual reasoning and the integration of visual-native information in reasoning chains.
VisBrowse-Bench: A high-quality, expert-validated benchmark of 169 instances designed to force multimodal fusion and visual competency. It is the first to require cross-image reasoning and active visual evidence gathering.
Agentic Workflow: A novel multi-turn framework that drives agents to actively collect and reason over visual information, demonstrating that tool usage significantly impacts performance but remains insufficient for current state-of-the-art models.
Comprehensive Evaluation: A rigorous evaluation of both open-source and closed-source models, providing a new baseline for multimodal browsing capabilities.

4. Experimental Results

The authors evaluated a wide range of models, including closed-source MLLMs (Gemini, GPT, Claude, Kimi), open-source models (Qwen), and Deep Research models (o3).

Overall Performance: The results are surprisingly low, highlighting the difficulty of the benchmark.
- Best Performer: Claude-4.6-Opus achieved the highest accuracy at 47.6%.
- Deep Research Model: o3-deep-research achieved only 41.1%.
- Open-Source Models: The best open-source model (Qwen3-VL-235B) achieved only 14.2%.
- General Trend: Most models hovered around 30% accuracy.
Impact of Tools:
- Direct Answer: Performance was very poor (relying on parametric knowledge).
- + Text Search: Improved performance for all models, but gains were limited, proving text alone is insufficient.
- + Image Search (Full Framework): Further improved performance for most models, confirming the necessity of visual evidence. However, some models (e.g., Claude-4.6-Sonnet) failed to utilize image tools correctly, leading to performance degradation.
Tool Usage Analysis:
- Claude-4.6-Opus showed the most balanced tool usage, effectively integrating text search, image search, and cropping.
- Kimi-K2.5 relied heavily on image_search, which correlated with its highest performance improvement under the full framework.
- Gemini models showed a heavy bias toward text_search and webpage_visit, often neglecting visual reasoning tools.

5. Significance and Conclusion

Benchmarking Standard: VisBrowse-Bench sets a new standard for evaluating multimodal agents by moving beyond "text-with-image" queries to "visual-native" reasoning where the agent must actively hunt for visual clues.
Current Limitations: The low accuracy of even the most advanced models (Claude-4.6-Opus < 50%) indicates that current MLLMs struggle with multi-hop visual reasoning and dynamic visual information integration. They tend to rely on text shortcuts or fail to ground their reasoning in specific visual details.
Future Direction: The paper suggests that future agent development must focus on improving the ability to:
1. Dynamically switch between modalities.
2. Perform fine-grained visual grounding (cropping, spatial analysis).
3. Synthesize heterogeneous evidence (text + multiple images) without collapsing into unimodal patterns.

In summary, VisBrowse-Bench reveals a significant capability gap in current AI agents regarding real-world visual search, providing a critical testbed for the next generation of multimodal browsing agents.