VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

Imagine you have a very smart, well-read librarian named VSearcher.

In the past, this librarian was like a walking encyclopedia. They knew everything written in books up to a certain date. If you asked them, "Who won the World Cup in 1998?" they could answer instantly. But if you asked, "What's the weather in Tokyo right now?" or "Show me a picture of that weird bird I just saw," they would be stuck. They couldn't leave the library, they couldn't use the internet, and they couldn't see the world outside their books.

VSearcher is the upgrade that turns this static librarian into a super-sleuth detective who can actually go out into the real world, use tools, and solve complex mysteries.

Here is how the paper explains this transformation, broken down into simple steps:

1. The Problem: The "Static" Librarian

Most current AI models are like that encyclopedia librarian. They are great at reading and talking, but they are "blind" to the real world. They can't look at a photo you take and say, "Oh, that's a rare orchid!" and then search for it. They also can't browse the web to find the latest news. They are stuck with what they memorized during training.

2. The Solution: Teaching the Detective to Hunt

The authors created VSearcher, a model that doesn't just "know" things; it knows how to find things. It can:

Read text (like a normal search).
Look at images (like a reverse image search).
Visit websites (like clicking a link and reading the page).
Do all of this in a long chain of steps (e.g., "Find the bird in the photo" $\rightarrow$ "Search for its name" $\rightarrow$ "Find its habitat" $\rightarrow$ "Check if it's endangered").

3. How They Trained It: The "Simulated Training Camp"

You can't just tell a robot to "go learn." You have to teach it. The paper describes a three-step training process that sounds like a video game level design:

Step A: Building the "Obstacle Course" (Data Synthesis)

To teach the detective, you need hard puzzles. The authors built a machine that automatically creates super-hard riddles.

The Analogy: Imagine taking a simple question like "Who is the President?" and slowly turning it into a mystery.
- Round 1: "Who is the President of the country that won the 1998 World Cup?"
- Round 2: "Who is the President of the country whose capital city has a statue of a man who invented the lightbulb?"
- Round 3 (The Multimodal Twist): They take a photo of a specific, obscure object and ask, "Who is the President of the country where this object (shown in the image) is a national symbol?"
They create thousands of these puzzles, making sure they are so hard that the AI must use the internet to solve them.

Step B: The "Shadowing" Phase (Rejection Sampling)

Now, they need a teacher. They used a very powerful, expensive AI (like a "Grandmaster Detective") to solve these puzzles first.

The Analogy: The Grandmaster solves the puzzle step-by-step. If the Grandmaster gets the answer wrong, that attempt is thrown in the trash. If they get it right, the AI student (VSearcher) studies that perfect solution.
This teaches VSearcher the habit of using tools correctly before it tries to learn on its own.

Step C: The "Real-World Gym" (Reinforcement Learning)

This is the magic sauce. The AI is now sent into a simulated internet environment to practice on its own.

The Analogy: Imagine the AI is playing a game where it gets a point only if it finds the correct answer. If it guesses wrong or gets stuck, it gets zero points.
It tries, fails, tries again, and eventually learns: "Hey, when I see a weird picture, I should use the 'Image Search' tool first, not just guess." Over millions of tries, it becomes a master at navigating the web.

4. The Result: The New Champion

The authors tested VSearcher against other smart AIs and even some expensive, proprietary models (like the ones from big tech companies).

The Outcome: VSearcher didn't just keep up; it beat them. It solved complex, multi-step visual and text puzzles that stumped the others.
Why? Because it wasn't just memorizing facts; it learned the skill of searching, just like a human detective learns to follow clues.

Summary Metaphor

Think of other AI models as Tourists with a guidebook. They can tell you about the Eiffel Tower because they read about it.
VSearcher is the Local Guide who has a map, a camera, and a phone. If you show them a photo of a strange street sign, they don't just guess; they take a picture, search for the language, find the street name, look up the history of that building, and tell you exactly where you are.

The paper proves that by training AI to act and search in the real world, rather than just thinking in a vacuum, we can build agents that are truly helpful for complex, real-life problems.

1. Problem Statement

While Large Language Models (LLMs) have evolved into autonomous agents capable of using tools (e.g., search, browsing) to augment their static knowledge, most current research focuses on text-only agents. These agents struggle with real-world scenarios requiring visual perception. Conversely, Multimodal Large Models (MLLMs) possess strong perceptual abilities but are typically limited to static knowledge and lack the capability to access up-to-date web information or perform long-horizon, multi-turn tool usage.

The Core Challenge: How to transform a static MLLM into an autonomous agent capable of long-horizon, multi-turn reasoning that dynamically integrates text search, image search, and web browsing to solve complex, real-world multimodal queries.

2. Methodology

The authors propose VSearcher, a comprehensive post-training framework that converts a base MLLM into a multimodal search agent. The pipeline consists of three main stages:

A. Iterative Injection-based Data Synthesis

To train an agent for complex tasks, high-quality, difficult training data is required. The authors propose a fully automated pipeline to generate large-scale multimodal QA pairs:

Seed Selection: Rare entities are selected from Wikidata using SPARQL rules (low "sitelinks" for rarity, high "statements" for information density).
Initial QA Generation: A simple text-only QA pair is generated from the entity's Wikipedia content.
Text Information Injection (Iterative): The question is progressively complicated over multiple rounds (1 for Easy, 3 for Medium, 5 for Hard). In each round, an entity is selected, its specific (rare) Wikipedia information is extracted, and the original question is transformed by hiding the entity and replacing it with the extracted text. This forces the model to search for the entity rather than relying on static knowledge.
Image Injection: A critical entity in the final question is selected, its Wikipedia image is retrieved, and the text mention is replaced with a phrase like "shown in the image." This ensures the image is essential for solving the problem.
Filtering: A rigorous filtering strategy removes samples that can be answered by weaker models without tools, those where the image is too simple, or where the answer is accidentally revealed in the text.

B. Rejection Sampling Fine-Tuning (RFT)

This stage "cold-starts" the base model with multi-turn tool-use capabilities:

Teacher Model: A powerful proprietary model (Gemini-3-Pro-Thinking) is used to generate trajectories (ReAct loops: Reasoning $\to$ Action $\to$ Observation) for the synthesized tasks.
Rejection Sampling: Trajectories ending in incorrect answers are discarded. Only high-quality, correct trajectories are retained.
Supervised Fine-Tuning (SFT): The base MLLM is fine-tuned on these high-quality trajectories to learn the initial behavior of calling tools (Image Search, Text Search, Visit) and reasoning iteratively.

C. Reinforcement Learning (RL)

To generalize the agent's capabilities in real-world environments:

Algorithm: The authors use Group Reward Proximal Optimization (GRPO).
Environment: Training occurs in real-world web environments using actual APIs (Google Custom Search, Google Vision Web Detection, JINA for page summarization).
Reward Function: The reward is binary ($1 $for correct final answer,$ 0$ for incorrect), verified by an LLM-as-a-Judge.
Constraints: Strict format checking is enforced during rollouts to ensure valid tool-calling syntax. The process encourages the agent to adaptively select tools and navigate complex web sources to find the answer.

3. Key Contributions

Iterative Injection-based Data Synthesis: A novel, fully automated pipeline to generate large-scale, high-difficulty multimodal browsing tasks that require deep reasoning and tool use, overcoming the scarcity of such data.
Comprehensive Post-Training Pipeline: A unified framework combining Rejection Sampling Fine-Tuning (to instill initial tool-use skills from a teacher) and Reinforcement Learning (to generalize these skills in real web environments).
MM-SearchExam Benchmark: A new, highly challenging benchmark curated from the synthesis pipeline (283 tasks) designed to evaluate long-horizon multimodal search. It is difficult enough that recent proprietary models (e.g., GPT-5, Gemini-3-Pro) achieve low accuracy.
State-of-the-Art Performance: VSearcher demonstrates superior performance compared to existing open-source agentic models and even surpasses several proprietary models on multimodal web search tasks.

4. Experimental Results

The authors evaluated VSearcher across five benchmarks: MMSearch, BrowseComp-VL, MM-BrowseComp, SimpleVQA, and MM-SearchExam.

Performance: VSearcher outperforms strong open-source baselines (e.g., Qwen3-VL, InternVL3.5) and recent agentic models (e.g., MMSearch-R1, DeepEyesV2).
Comparison with Proprietary Models: Notably, VSearcher surpasses GPT-5 and Gemini-3-Pro on specific benchmarks like MMSearch and BrowseComp-VL.
Ablation Studies:
- RFT vs. RL: The model shows progressive improvement: Base Model $\to$ RFT (significant gain in tool-use capability) $\to$ RL (further gains in accuracy and adaptability).
- Tool Usage Analysis: During RL training, the agent learns to rely heavily on Text Search for information gathering and increasingly uses the Visit tool to verify intermediate conclusions, while Image Search is used judiciously for the initial visual input.
- Data Quality: The synthesized data distribution closely matches the complexity of the evaluation benchmarks, validating the effectiveness of the Iterative Injection pipeline.

5. Significance

Bridging the Gap: VSearcher successfully bridges the gap between static multimodal perception and dynamic, real-world information retrieval, enabling MLLMs to act as true autonomous agents in web environments.
Scalable Data Generation: The proposed data synthesis method offers a scalable solution to the lack of high-quality, long-horizon multimodal training data, which has been a bottleneck in the field.
Real-World Applicability: By training and evaluating in real web environments with actual APIs, the results are more indicative of real-world utility than simulations.
Future Direction: The work highlights that Reinforcement Learning, when combined with high-quality synthetic data and real-world tool integration, is a powerful paradigm for advancing multimodal agentic capabilities beyond simple text-based reasoning.