MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Imagine you have a brilliant, well-read librarian named MM-DeepResearch. This librarian is incredibly smart and can see pictures, read books, and understand complex stories. However, like any human, they have a limit: they only know what they've read in their training books up to a certain date. If you ask them about a breaking news story, a specific building's architect, or a new gadget released yesterday, they might guess or hallucinate because that information isn't in their "memory."

The paper introduces a way to turn this librarian into a Super-Researcher who doesn't just rely on memory but knows exactly how to go out, find the right books, cross-reference them, and synthesize a perfect answer.

Here is how they did it, explained through three simple analogies:

1. The Problem: The "Empty Library" and the "Expensive Phone Call"

The researchers noticed three big problems when trying to teach AI to do deep research:

No Practice Questions: There weren't enough "hard" test questions that required looking up information in both pictures and text. Most existing questions were too easy.
No Good Maps: The AI didn't know the best "path" to take. Should it look at a picture first? Then a text? Then another picture? It was getting lost.
Too Expensive: To teach the AI, you usually have to let it use real internet search engines (like Google) while it learns. But every search costs money. If you let an AI search thousands of times to learn, it costs thousands of dollars. It's like trying to teach a kid to drive by letting them burn through a tank of gas every day.

2. The Solution: Three Magic Tools

To fix this, the team built three special tools to train their AI without spending a fortune.

Tool A: Hyper-Search (The "Web of Connections")

The Analogy: Imagine a giant spiderweb where every node is a piece of information (a photo, a paragraph of text, a video).
How it works: Instead of just grabbing a random picture and asking a question, the researchers built a system that creates a hypergraph (a super-web).

They start with a picture (e.g., a photo of a castle).
The system automatically finds related text (history of the castle), related images (other angles), and related facts (who built it).
It then generates a tricky question that requires jumping across this web to answer. You can't answer it by just looking at the original photo; you have to follow the "threads" of the web.
Result: They created a massive library of "hard" practice questions that force the AI to learn how to hunt for information.

Tool B: DR-TTS (The "Specialized Training Camp")

The Analogy: Imagine trying to teach a student to be a master detective. If you throw them into a complex case immediately, they might fail. Instead, you train them on specific skills first.
How it works:

Decompose: They broke the "detective work" down. One expert learns only how to search for images. Another learns only how to search for text. Another learns how to ask a smart question to an AI expert.
Recompose: Once these "mini-experts" are good at their specific jobs, they are put back together. They work as a team to solve a complex case using a "Tree Search" method (like trying different paths in a maze to find the exit).
Result: This creates a perfect "map" (a trajectory) showing the AI exactly how to solve a problem step-by-step, using the right tool at the right time.

Tool C: The Offline Search Engine (The "Simulated World")

The Analogy: Instead of letting a pilot fly a real plane to learn how to land (which is dangerous and expensive), you put them in a flight simulator.
How it works:

The researchers built a giant, offline database of the internet (a "simulated web") containing millions of pre-collected images and text snippets.
When the AI needs to "search," it searches this local database instead of the real, expensive internet.
Result: The AI can practice searching millions of times for free and instantly. It learns the behavior of searching without the cost.

3. The Result: The Super-Researcher

By combining these three tools, they trained MM-DeepResearch.

It learns to plan: It doesn't just guess; it thinks, "I need to find the architect. First, I'll use the image tool to identify the building. Then, I'll use the text tool to find the history."
It handles complexity: It can look at a picture of a museum, realize it needs to know the opening hours, search for the museum's name, find the website, and extract the hours—all in one go.
It beats the competition: When tested on hard benchmarks (like identifying facts about art, sports, or history), this new AI outperformed other models, even those that were trained using the expensive, real-internet method.

The Big Takeaway

This paper is like a blueprint for building a self-driving car for information.
Instead of paying a fortune to drive the car on real highways to teach it how to navigate, the researchers built:

A map generator (Hyper-Search) to create difficult routes.
A driving school (DR-TTS) to teach specific skills like braking and turning.
A simulator (Offline Engine) to let the car practice for free.

The result is a smart agent that can do deep, complex research on its own, saving money and time while getting smarter.

1. Problem Statement

The paper addresses the limitations of current Multimodal Large Language Models (MLLMs) in handling deep research tasks that require information beyond their pre-trained knowledge. While existing models can perform reasoning, they struggle with search-intensive multimodal QA due to three primary challenges:

Data Scarcity: There is a lack of high-quality, search-intensive multimodal QA datasets that require multi-turn interactions and the invocation of diverse search tools.
Ineffective Trajectories: Traditional methods for synthesizing search trajectories are often designed for single-turn searches and fail to capture the complexity of multi-round, multi-tool interactions required for deep research.
Prohibitive Training Costs: Training agents using online search APIs (e.g., SerpAPI, Jina) is extremely expensive (thousands of dollars per run) and introduces high latency, hindering systematic exploration and large-scale reinforcement learning (RL).

2. Methodology

The authors propose MM-DeepResearch, a multimodal deep research agent built upon three core technical components designed to overcome the aforementioned challenges without relying on costly online APIs during training.

A. Hyper-Search: Data Generation via Hypergraphs

To solve the data scarcity issue, the authors introduce Hyper-Search, a method for generating search-intensive multimodal QA pairs.

Hypergraph Construction: Web information (images and text) is modeled as nodes in a hypergraph.
- Nodes: Image nodes ( $I$ ) and text nodes ( $T$ ).
- Expansion: Nodes are expanded using reverse image search, visual similarity search, and text-based URL extraction to create a depth- $D$ hypergraph.
- Hyperedges: These connect a parent node with all its expanded child nodes, capturing cross-modal dependencies.
QA Generation: Questions are generated at two levels:
- Intra-hyperedge: Requires searching within a single connected component (e.g., an image and its related webpages).
- Inter-hyperedge: Requires synthesizing information across multiple hyperedges, necessitating deeper, multi-turn search.
Filtering: Low-quality or "search-free" pairs are filtered out using MLLMs, resulting in the Hyper-Search-3K dataset.

B. DR-TTS: Decompose–Recompose Tool Tree Search

To generate effective search trajectories without a single model failing to master all tools, the authors propose DR-TTS.

Decomposition: Search tasks are categorized by tool type (e.g., Text-to-Text, Image-to-Image). Specialized "tool experts" are trained via Reinforcement Learning (RL) on specific subsets of data, ensuring high proficiency in individual tools.
Recomposition: These experts are then combined in a Tree Search framework. Starting from a root question, the tree branches using different tool experts at each step.
Trajectory Synthesis: The system explores paths to find valid, successful trajectories. Correct trajectories are extracted to form the DR-TTS-10K dataset for Supervised Fine-Tuning (SFT). This approach mitigates tool-use bias and improves exploration diversity compared to single-model generation.

C. Offline Multimodal Search Engine

To eliminate the cost of online APIs during RL training, the authors built a custom Offline Search Engine.

Corpus: A pre-constructed multimodal corpus containing diverse images and text (including Wikipedia and pre-fetched web content).
Tools: Supports both Information-based search (retrieving grounded facts via Text-to-Text, Text-to-Image, Image-to-Image) and Knowledge-based search (querying expert models for complex reasoning).
Benefit: This allows for scalable, low-latency, and zero-cost RL training (specifically using GRPO) while simulating real-world search environments.

D. Training Pipeline

The agent is trained in two stages:

SFT (Cold Start): The model is fine-tuned on the DR-TTS-10K trajectories to learn tool invocation patterns and cross-modal integration.
Reinforcement Learning (RL): The model is optimized using GRPO (Group Relative Policy Optimization) on the Hyper-Search data using the Offline Search Engine. The reward function balances format compliance (valid tool calls) and answer accuracy (verified by an LLM judge).

3. Key Contributions

Hyper-Search: The first work to utilize hypergraphs for modeling and connecting visual/textual nodes to synthesize search-intensive multimodal QA data, ensuring questions inherently require multi-tool, multi-turn searches.
DR-TTS: A novel Decompose–Recompose framework that trains specialized tool experts and recomposes them via tree search to generate high-quality, diverse search trajectories, overcoming the limitations of single-model trajectory generation.
Offline Search Engine: A scalable, cost-effective search infrastructure that supports multimodal retrieval and knowledge-based queries, enabling agentic RL training without expensive online API costs.
MM-DeepResearch Agent: A powerful baseline agent that demonstrates superior performance across multiple benchmarks, proving that effective deep research agents can be trained from scratch or enhanced significantly using these methods.

4. Experimental Results

The authors evaluated MM-DeepResearch on six information-intensive benchmarks (SimpleVQA, MMSearch, LiveVQA, FVQA, InfoSeek, Browsecomp-VL).

Performance Gains:
- From Scratch: MM-DeepResearch-7B (trained on a base model without native tool capabilities) outperformed previous 7B agentic search models (e.g., Visual-ARFT, MMSearch-R1-7B) by an average of 23% and 7.1% respectively.
- Enhancement: When applied to agentic foundational models (Qwen3-VL-8B/32B), MM-DeepResearch-8B improved the baseline by 17% on average and surpassed the state-of-the-art SenseNova-MARS-8B by 3.4 points.
- Scaling: The 32B version achieved an average score of 65.3, significantly outperforming the baseline Qwen3-VL-32B (50.4).
Ablation Studies:
- Hyper-Search: RL training on Hyper-Search data led to deeper exploration (more tool calls) and higher accuracy compared to graph-based or standard datasets.
- DR-TTS: SFT on DR-TTS trajectories significantly boosted performance (from 37.4 to 52.3 on MMSearch), validating the efficacy of the trajectory synthesis.
- Offline vs. Online: While online search yielded slightly better results during evaluation (due to fresher data), the offline search engine provided competitive results during training at a fraction of the cost (0 cost vs. ~$640 per 100 steps).

5. Significance

This paper establishes a simple yet effective baseline for multimodal deep research. Its significance lies in:

Democratizing Research: By removing the dependency on costly online APIs and providing a method to generate high-quality training data and trajectories, it lowers the barrier for developing advanced multimodal agents.
Methodological Innovation: The combination of hypergraph-based data synthesis and decomposed tree-search trajectory generation offers a robust framework for handling complex, multi-step reasoning tasks.
Scalability: The offline search engine enables large-scale RL experiments that were previously economically unfeasible, paving the way for more sophisticated agentic systems in the future.

The code and data are open-sourced, facilitating further research in multimodal agentic reasoning.