MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

The paper introduces MM-DeepResearch, a multimodal deep research agent that overcomes data scarcity, trajectory generation, and training cost challenges through Hyper-Search for data synthesis, DR-TTS for specialized tool optimization and trajectory planning, and an offline search engine for cost-effective reinforcement learning.

Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, well-read librarian named MM-DeepResearch. This librarian is incredibly smart and can see pictures, read books, and understand complex stories. However, like any human, they have a limit: they only know what they've read in their training books up to a certain date. If you ask them about a breaking news story, a specific building's architect, or a new gadget released yesterday, they might guess or hallucinate because that information isn't in their "memory."

The paper introduces a way to turn this librarian into a Super-Researcher who doesn't just rely on memory but knows exactly how to go out, find the right books, cross-reference them, and synthesize a perfect answer.

Here is how they did it, explained through three simple analogies:

1. The Problem: The "Empty Library" and the "Expensive Phone Call"

The researchers noticed three big problems when trying to teach AI to do deep research:

  • No Practice Questions: There weren't enough "hard" test questions that required looking up information in both pictures and text. Most existing questions were too easy.
  • No Good Maps: The AI didn't know the best "path" to take. Should it look at a picture first? Then a text? Then another picture? It was getting lost.
  • Too Expensive: To teach the AI, you usually have to let it use real internet search engines (like Google) while it learns. But every search costs money. If you let an AI search thousands of times to learn, it costs thousands of dollars. It's like trying to teach a kid to drive by letting them burn through a tank of gas every day.

2. The Solution: Three Magic Tools

To fix this, the team built three special tools to train their AI without spending a fortune.

Tool A: Hyper-Search (The "Web of Connections")

The Analogy: Imagine a giant spiderweb where every node is a piece of information (a photo, a paragraph of text, a video).
How it works: Instead of just grabbing a random picture and asking a question, the researchers built a system that creates a hypergraph (a super-web).

  • They start with a picture (e.g., a photo of a castle).
  • The system automatically finds related text (history of the castle), related images (other angles), and related facts (who built it).
  • It then generates a tricky question that requires jumping across this web to answer. You can't answer it by just looking at the original photo; you have to follow the "threads" of the web.
  • Result: They created a massive library of "hard" practice questions that force the AI to learn how to hunt for information.

Tool B: DR-TTS (The "Specialized Training Camp")

The Analogy: Imagine trying to teach a student to be a master detective. If you throw them into a complex case immediately, they might fail. Instead, you train them on specific skills first.
How it works:

  • Decompose: They broke the "detective work" down. One expert learns only how to search for images. Another learns only how to search for text. Another learns how to ask a smart question to an AI expert.
  • Recompose: Once these "mini-experts" are good at their specific jobs, they are put back together. They work as a team to solve a complex case using a "Tree Search" method (like trying different paths in a maze to find the exit).
  • Result: This creates a perfect "map" (a trajectory) showing the AI exactly how to solve a problem step-by-step, using the right tool at the right time.

Tool C: The Offline Search Engine (The "Simulated World")

The Analogy: Instead of letting a pilot fly a real plane to learn how to land (which is dangerous and expensive), you put them in a flight simulator.
How it works:

  • The researchers built a giant, offline database of the internet (a "simulated web") containing millions of pre-collected images and text snippets.
  • When the AI needs to "search," it searches this local database instead of the real, expensive internet.
  • Result: The AI can practice searching millions of times for free and instantly. It learns the behavior of searching without the cost.

3. The Result: The Super-Researcher

By combining these three tools, they trained MM-DeepResearch.

  • It learns to plan: It doesn't just guess; it thinks, "I need to find the architect. First, I'll use the image tool to identify the building. Then, I'll use the text tool to find the history."
  • It handles complexity: It can look at a picture of a museum, realize it needs to know the opening hours, search for the museum's name, find the website, and extract the hours—all in one go.
  • It beats the competition: When tested on hard benchmarks (like identifying facts about art, sports, or history), this new AI outperformed other models, even those that were trained using the expensive, real-internet method.

The Big Takeaway

This paper is like a blueprint for building a self-driving car for information.
Instead of paying a fortune to drive the car on real highways to teach it how to navigate, the researchers built:

  1. A map generator (Hyper-Search) to create difficult routes.
  2. A driving school (DR-TTS) to teach specific skills like braking and turning.
  3. A simulator (Offline Engine) to let the car practice for free.

The result is a smart agent that can do deep, complex research on its own, saving money and time while getting smarter.