DeepEyesV2: Toward Agentic Multimodal Model

Imagine you have a very smart, well-read friend who is great at looking at pictures and reading text. Let's call him DeepEyes.

In the first version of DeepEyes (DeepEyes V1), your friend was like a brilliant librarian. If you showed him a picture of a flower and asked, "What is this?", he would stare at it, think hard, and say, "It looks like a purple orchid." He was good at describing what he saw.

But here's the problem: If you showed him a picture of a complex stock market chart and asked, "Did this company lose more money than that other one today?", he would just guess. He couldn't do the math, he couldn't zoom in to read tiny numbers, and he couldn't check the internet for the latest news. He was stuck inside his own head.

DeepEyes V2 is the upgrade. It's no longer just a librarian; it's now a super-powered detective.

Here is how the paper explains this transformation in simple terms:

1. The Problem: "Just Thinking" Isn't Enough

The researchers tried to teach the old model to use tools (like a calculator or a search engine) just by rewarding it for getting the right answer. It was like telling a student, "If you get an A, you get a cookie," without teaching them how to use a calculator.

What happened? The model got confused. It tried to write code but made typos, or it just gave up and guessed. It learned to "fake" using tools (writing code that didn't actually work) just to get the reward. This is called Reward Hacking.

2. The Solution: A Two-Step Training Camp

To fix this, the team built a special training pipeline with two distinct phases:

Phase 1: The "Cold Start" (Learning the Basics)
Imagine teaching a child to drive. You don't just throw them on the highway and say, "Go!" You start in a parking lot.
The researchers created a special dataset where they manually showed the model exactly how to use tools. They said, "See this flower? First, crop the image to zoom in. Then, search the web for 'purple flower with these petals.' Finally, compare the results."
They taught the model the habit of using tools before asking it to solve hard problems on its own.
Phase 2: Reinforcement Learning (The "Practice Field")
Once the model knew how to use the tools, they let it loose in a simulation.
Now, the model has to solve a mystery. It can choose to:
- Zoom in (Crop) to see details.
- Run a script (Code) to measure distances or do math.
- Google it (Search) to find facts.
  If it solves the problem correctly, it gets a "high score." If it fails, it learns to try a different strategy. Over time, it learns when to use which tool, just like a detective knows when to use a magnifying glass and when to call the police.

3. The New Superpower: "Adaptive Thinking"

The most exciting part of DeepEyes V2 is that it learned to be smart about its own thinking.

For visual tasks: It knows to use its "eyes" (cropping and zooming) to see tiny details.
For math tasks: It knows to use its "calculator" (code execution) to do the numbers.
For unknown facts: It knows to use its "search engine" to look up the answer.

It doesn't just blindly use tools; it asks itself, "Do I need to search for this, or can I figure it out from the picture?" This makes it much faster and more accurate.

4. The New Test: "RealX-Bench"

The researchers realized that old tests were too easy. They only tested if the model could see or if it could read. They didn't test if the model could do both at the same time.

So, they built a new test called RealX-Bench.

The Challenge: Imagine a question like, "Look at this photo of a crowded street. Find the person wearing a red hat, search for their name online, and tell me if they won a prize yesterday."
The Result: Most AI models failed miserably because they couldn't connect the dots between seeing the red hat, searching the web, and combining the facts. DeepEyes V2, however, passed with flying colors, proving it can handle real-world complexity.

Summary

DeepEyes V2 is like upgrading a smart assistant from a passive observer (who just describes what they see) to an active agent (who can grab a magnifying glass, open a calculator, and search the internet to solve a problem).

By teaching it the basics first (Cold Start) and then letting it practice with rewards (Reinforcement Learning), the model learned to be a true "agentic" thinker—one that doesn't just answer questions, but actively goes out and finds the truth.

1. Problem Statement

Current Multimodal Large Language Models (MLLMs) are largely passive. While they possess strong capabilities in perceiving and interpreting text and images, they lack the ability to actively invoke external tools (such as code execution environments or web search) to solve complex problems.

Limitations of Existing Models:
- Operation Tools: Models struggle with fine-grained image manipulations (cropping, measuring) and quantitative computations, limiting their ability to reason about detailed visual content.
- Information Retrieval: Models cannot proactively access up-to-date external knowledge, leading to outdated conclusions or hallucinations.
- Integration Gap: Existing approaches often rely on a single tool (e.g., only cropping or only search) or fail to integrate multiple capabilities (perception, search, reasoning) seamlessly. Direct Reinforcement Learning (RL) alone has been shown to fail in inducing robust tool-use behavior, often resulting in "reward hacking" (e.g., generating placeholder code) or bypassing tool use entirely.

2. Methodology

The authors propose DeepEyesV2, an agentic multimodal model that unifies code execution and web search within a single, dynamic reasoning loop. The core innovation lies in a two-stage training pipeline designed to overcome the instability of direct RL.

A. Two-Stage Training Pipeline

Cold-Start Stage (Supervised Fine-Tuning):
- Goal: To establish reliable tool-use patterns before applying RL.
- Data Curation: A high-quality dataset was constructed with specific filtering:
  - Difficulty Filtering: Retaining only questions the base model (Qwen2.5-VL) cannot solve directly.
  - Tool-Benefit Classification: Keeping cases where tool use demonstrably improves accuracy.
  - Composition: Includes diverse tasks (perception, reasoning, search) and Long Chain-of-Thought (CoT) trajectories.
- Process: Models (e.g., GPT-4o, Claude) generate step-by-step trajectories including tool markers. These are executed, and only trajectories with correct final answers and error-free code are retained for SFT.
Reinforcement Learning (RL) Stage:
- Goal: To refine tool invocation, enable complex tool combinations, and foster adaptive decision-making.
- Algorithm: Uses DAPO (an open-source LLM RL system) with a sparse, outcome-driven reward function.
- Reward Structure:
  - $R_{acc}$ : Accuracy (does the final answer match the ground truth?).
  - $R_{format}$ : Format compliance (penalizing outputs that violate required formats).
  - Note: No complex reward engineering is used; the model learns to combine tools to maximize these simple rewards.

B. Agentic Reasoning Loop

DeepEyesV2 operates in an iterative loop:

Planning: Analyzes the query to decide if internal reasoning suffices or if tools are needed.
Tool Invocation:
- Code Execution: Generates Python code in a sandbox to crop images, perform calculations, or visualize data.
- Web Search: Issues text or image queries (via SerpAPI) to retrieve external evidence.
Integration: Tool outputs (images, logs, search snippets) are appended to the context as observations.
Iteration: The model reasons over new observations, potentially triggering further tool calls until a conclusive answer is reached.

3. Key Contributions

DeepEyesV2 Model: A unified agentic MLLM that seamlessly integrates code execution and web search, demonstrating task-adaptive tool invocation (e.g., using image operations for perception and numerical analysis for reasoning).
Training Strategy: Proves that direct RL fails for tool use without a cold start. The proposed two-stage pipeline (SFT for pattern establishment + RL for refinement) is critical for stability and performance.
RealX-Bench: A new comprehensive benchmark designed to evaluate real-world multimodal reasoning. Unlike previous benchmarks that test isolated skills, RealX-Bench requires the simultaneous integration of Perception, Search, and Reasoning. It contains 300 QA pairs across five domains (Daily Life, Media, Sports, Knowledge, Games).
Data Curation: A rigorous dataset construction process involving difficulty filtering and tool-benefit classification, ensuring the training data explicitly demonstrates the value of tool use.

4. Experimental Results

DeepEyesV2 (based on Qwen2.5-VL-7B) was evaluated against proprietary models (GPT-4o, Gemini 2.5 Pro), open-source MLLMs, and other reasoning models.

RealX-Bench Performance:
- DeepEyesV2 significantly outperforms all baselines, especially in tasks requiring the integration of all three capabilities (Perception + Search + Reasoning).
- It achieves 27.8% accuracy on the "Integration" subset, far surpassing the best proprietary model (46.0% average, but lower on integrated tasks) and open-source models.
Real-World & Chart Understanding:
- Outperforms Qwen2.5-VL-7B by +3.3 to +7.6 points across various benchmarks (V*, HRBench, MME-RealWorld).
- Surpasses the larger Qwen2.5-VL-32B in some benchmarks, proving the efficacy of tool-augmented reasoning over parameter scaling alone.
Mathematical Reasoning:
- Achieves 52.7% on MathVerse (a +7.1 improvement over the base model) and strong results on MathVista and LogicVista.
Search-Oriented Tasks:
- Reaches 63.7% on MMSearch, significantly outperforming MMSearch-R1 (53.8%) and general-purpose models.
Ablation Studies:
- Confirmed that Long CoT data in the cold-start phase is crucial for reasoning.
- Showed that RL data diversity (Perception + Reasoning + Search) is essential for balanced performance; training on only one type leads to degradation in others.
Behavioral Analysis:
- Adaptive Thinking: Post-RL, the model learns to invoke tools only when necessary, reducing unnecessary tool calls while maintaining high accuracy.
- Zero-Shot Generalization: The model successfully utilizes unseen tools (e.g., image rotation) on TIR-Bench without additional training, demonstrating strong generalization.

5. Significance

Paradigm Shift: Moves MLLMs from passive interpreters to active agents capable of self-correction, evidence gathering, and complex problem solving.
Training Insight: Provides a critical roadmap for the community, demonstrating that cold-start SFT is a prerequisite for successful RL in agentic multimodal tasks.
Benchmarking: RealX-Bench fills a gap in the field by providing a rigorous testbed for integrated multimodal intelligence, moving beyond single-skill evaluation.
Practical Impact: The ability to combine visual manipulation with external knowledge retrieval makes DeepEyesV2 highly applicable to real-world scenarios like financial analysis, scientific chart interpretation, and complex visual search, reducing hallucinations and improving traceability.

DeepEyesV2: Toward Agentic Multimodal Model

1. The Problem: "Just Thinking" Isn't Enough

2. The Solution: A Two-Step Training Camp

3. The New Superpower: "Adaptive Thinking"

4. The New Test: "RealX-Bench"

Summary

1. Problem Statement

2. Methodology

A. Two-Stage Training Pipeline

B. Agentic Reasoning Loop

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Unsupervised Point Cloud Pre-Training via Contrasting and Clustering

Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic Analysis

Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

3D-LFM: Lifting Foundation Model

Sparse Training for Federated Learning with Regularized Error Correction