VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

This paper introduces VisBrowse-Bench, a new benchmark and agent workflow designed to evaluate the visual reasoning capabilities of multimodal browsing agents through visual-native search, revealing that even state-of-the-art models struggle to achieve high accuracy on these tasks.

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Y
Published 2026-03-18
📖 5 min read🧠 Deep dive

Imagine you are hiring a detective to solve a mystery. In the past, you'd give the detective a stack of text files and ask them to find the answer. But the real world isn't just text; it's full of pictures, maps, and visual clues.

This paper introduces a new, tougher test for AI detectives called VisBrowse-Bench. Here is the breakdown in simple terms:

1. The Problem: The "Lazy Detective"

Current AI models (the detectives) are great at reading text, but they often cheat when it comes to pictures.

  • The Old Way: If you showed an AI a picture of a building and asked, "What was this before it was a museum?", the AI would just take a photo of the building, run it through a "reverse image search" tool, and read the Wikipedia title that pops up. It didn't actually look at the building; it just used a tool to find the name.
  • The Flaw: Once the AI gets the name (e.g., "Sapporo Beer Museum"), it stops looking at pictures. It switches to a text-only search to find the rest of the answer. It treats the visual world like a text book, ignoring the fact that real websites are full of images, charts, and visual layouts that you have to see to understand.

The Analogy: It's like asking a student to solve a math problem by looking at a graph, but the student just reads the title of the graph and then closes their eyes to do the math in their head. They miss the actual data.

2. The Solution: A New "Visual-Native" Test

The authors created VisBrowse-Bench, a new exam designed to force the AI to actually use its eyes throughout the whole process.

  • The Setup: They created 169 tricky questions. Each question starts with a picture, but the answer isn't in the picture's caption.
  • The Trap: To get the answer, the AI has to:
    1. Look at the first picture to find a clue.
    2. Go to a website.
    3. Find a new picture on that website that holds the next clue.
    4. Look at that new picture to find a specific detail (like the color of a tie or the number on a jersey).
    5. Go to another website, find another picture, and so on.

The Analogy: Imagine a scavenger hunt where you can't just read the clues; you have to find specific objects hidden in photos on different websites. If you close your eyes and just read the text, you fail. You have to keep your "eyes" open the whole time.

3. The New "Toolbelt"

To help the AI, the researchers gave it a special digital toolbelt with five tools:

  1. Text Search: Like Googling words.
  2. Image Search: Like finding pictures based on a description.
  3. Reverse Image Search: Like uploading a photo to find where it came from.
  4. Image Crop: Like using a magnifying glass to zoom in on a tiny part of a photo.
  5. Webpage Visit: Like opening a link to read the page.

The goal was to see if the AI could pick the right tool at the right time and switch between looking at text and looking at pictures seamlessly.

4. The Results: The AI is Still Struggling

The researchers tested the smartest AI models in the world (like Claude, GPT, and Gemini) on this new test.

  • The Score: Even the best AI (Claude-4.6-Opus) only got about 47.6% of the answers right. That's barely a "C" grade.
  • The Surprise: A specialized "Deep Research" model (o3) only got 41.1%.
  • Why? The AI models are still too comfortable with text. When they get stuck, they often try to guess or rely on what they "remembered" from their training data instead of going back to look at the pictures again. They struggle to connect the dots between a picture on Page A and a picture on Page B.

5. A Real-Life Example from the Paper

  • The Task: "In this picture, a person is holding a magic wand. Find the movie poster for the first film in this series. Who is the character below her, and what race are they?"
  • The AI's Journey:
    1. Look at the first image \rightarrow Realize it's Hermione from Harry Potter.
    2. Search for the first movie poster \rightarrow Find the poster.
    3. Look at the poster \rightarrow See a giant character below her.
    4. Search for that character \rightarrow Realize it's Hagrid.
    5. Search for Hagrid's race \rightarrow Answer: "Half-giant."
  • The Failure: Many AIs got stuck at step 3 or 4. They found the movie but couldn't "see" the character below Hermione in the poster, or they couldn't link the visual of the character to the text "Hagrid."

The Big Takeaway

This paper is a wake-up call. We are building AI agents that can browse the web, but we are testing them like they are just reading books. The real internet is a visual place. To build truly smart AI, we need to stop letting them "cheat" by just reading text and force them to learn how to look, zoom, and reason with pictures just as humans do.

The authors have made their test and their code public so other scientists can try to build better "visual detectives."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →