V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Imagine you are a detective trying to find a specific suspect in a lineup of 100 people based on a vague description: "The guy is wearing a blue shirt, but it's a lighter blue than the one in the photo, and he has extra buttons."

The Old Way (Current AI Models):
Most current AI models act like a detective who is blindfolded. They are handed a blurry, compressed photo of the suspect and a list of descriptions. They have to guess who matches based only on that blurry photo and their memory.

The Problem: If two suspects look very similar from a distance (both wear blue shirts), the AI gets confused. It starts guessing or "hallucinating," saying, "Oh, this guy must be the one because he looks kinda blue," even if he's actually wearing dark navy. It can't look closer because it doesn't have the ability to zoom in or ask for a better view.

The New Way (V-Retrver):
The paper introduces V-Retrver, which is like giving that detective a pair of high-powered binoculars and a magnifying glass, and telling them: "Don't just guess. Go look at the details yourself."

Here is how V-Retrver works, broken down into simple concepts:

1. The "Active Investigator" (Agentic Reasoning)

Instead of just staring at the screen and guessing, V-Retrver is an active agent. It doesn't just process the image once; it interacts with it.

The Analogy: Imagine you are looking at a map to find a hidden treasure. A normal AI looks at the whole map once and picks a spot. V-Retrver is like a treasure hunter who says, "Wait, that area looks foggy. Let me zoom in on that specific rock formation to see if there's an 'X'."
The Action: If the AI sees two candidates that look similar, it can select them to compare side-by-side or crop/zoom into a specific part of the image (like the buttons on a shirt or the pattern on a pillow) to get the truth.

2. The "Stop and Check" Habit (Multimodal Interleaved Reasoning)

The paper calls this "Multimodal Interleaved Reasoning." In plain English, it means the AI alternates between thinking and looking.

The Process:
1. Think: "Candidate A has a white sofa, but the query asked for a mottled (spotted) pillow."
2. Look: "I'm not sure if the pillow is spotted from this distance. Let me use my Zoom Tool to check the texture."
3. Think (Again): "Ah, I see! The pillow is actually smooth, not spotted. Candidate A is out."
4. Repeat: It keeps doing this loop until it is 100% sure.

3. The "Training Camp" (Curriculum Learning)

You can't just give a detective binoculars and expect them to be perfect immediately. They need training. The authors trained V-Retrver in three stages, like a martial arts dojo:

Stage 1 (The Basics): They taught the AI how to hold the binoculars (how to use the tools) and how to speak in a structured way (Chain-of-Thought).
Stage 2 (The Filter): They let the AI practice, but if it made a silly mistake or used the tools unnecessarily, they said, "No, try again." They only kept the "good" attempts to teach the AI.
Stage 3 (The Reward System): They gave the AI a reward system. If it found the right answer and used the tools efficiently (not zooming in 50 times when once was enough), it got a high score. If it wasted time, it got a penalty. This taught the AI to be smart and efficient.

Why Does This Matter?

In the real world, we often need to find things that are very similar but have tiny differences.

Shopping: "I want this exact dress, but in a smaller size."
Medical: "Find this specific type of skin lesion, but ignore the redness caused by sunburn."
Safety: "Find the car with the broken headlight, not the one with the dented bumper."

Old AI models often fail here because they rely on "static" images. V-Retrver succeeds because it treats retrieval like a conversation with the image, asking for proof before making a decision.

The Result

The paper shows that this "Active Detective" approach is much better. It found the right answers 23% more often than previous methods. It didn't just get lucky; it actually verified the evidence, making it much more reliable for complex, real-world tasks.

In short: V-Retrver is an AI that stopped guessing and started investigating. It uses tools to look closer, verify details, and only then makes a decision, just like a human expert would.

1. Problem Statement

Universal Multimodal Retrieval aims to retrieve relevant items (images, text, or interleaved multimodal content) given a query of any modality. While Multimodal Large Language Models (MLLMs) have advanced this field, existing approaches suffer from critical limitations:

Language-Driven Reasoning: Current methods often rely on static visual encodings (fixed embeddings) or textual descriptions. The reasoning process is primarily language-based, forcing the model to infer visual differences without actively inspecting the image.
Speculative Reasoning: In visually ambiguous scenarios (e.g., candidates with similar semantics but different fine-grained attributes like texture, color shade, or local context), static representations lead to hallucinations or speculative reasoning.
Lack of Active Verification: Existing Chain-of-Thought (CoT) frameworks, even those enhanced for reasoning (e.g., Retrv-R1, MM-R5), typically perform single-pass visual encoding. They lack the ability to dynamically verify hypotheses by "looking closer" at specific image regions during the reasoning process.

2. Methodology: V-Retrver Framework

The authors propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection.

A. Core Concept: Multimodal Interleaved Evidence Reasoning (MIER)

Unlike traditional CoT which is text-only, MIER interleaves textual hypothesis generation with targeted visual evidence acquisition.

Process: The model generates a hypothesis, identifies uncertainty, and invokes external visual tools to inspect specific parts of candidate images. It then updates its reasoning based on the new visual evidence before making a final ranking decision.
Tools: The agent is equipped with two primary tools:
1. select_images: Selects a subset of candidate images for closer comparison when semantic similarity is high.
2. crop_image (Zoom-in): Performs localized zooming on specific bounding boxes to analyze fine-grained details (textures, objects, spatial configurations).

B. System Architecture

The framework follows a coarse-to-fine pipeline:

Candidate Proposal: An embedding model ( $\phi$ ) performs a fast, initial retrieval to select the top- $K$ candidates from a large pool ( $N$ ), reducing the search space.
Agentic Reranking: A reasoning agent ( $\theta$ ) performs fine-grained reranking on the reduced set. This agent iteratively reasons, invokes tools, and revises rankings based on acquired visual evidence.

C. Training Strategy: Curriculum-Based Agentic Learning

Training V-Retrver requires transforming a general MLLM into a reliable agent. The authors employ a three-stage curriculum:

Stage 1: Cold-Start Supervised Fine-Tuning (SFT):
- Uses synthesized high-quality CoT data (generated by a stronger model, Qwen2.5-VL-72B) to teach the model basic reasoning syntax and tool invocation formats.
Stage 2: Rejection Sampling Fine-Tuning (RSFT):
- Samples multiple reasoning trajectories for each query.
- Filters out low-quality or logically inconsistent trajectories, retaining only those that strictly follow formatting rules and yield correct rankings. This improves reliability and structural compliance.
Stage 3: Evidence-Aligned Policy Optimization (EAPO):
- A Reinforcement Learning (RL) stage using Group Relative Policy Optimization (GRPO).
- Reward Function: A composite reward $R_i = \alpha r_{format} + \beta r_{rank} + r_{tool}$ $R_{i} = α r_{f or ma t} + β r_{r ank} + r_{t oo l}$ is used:
  - $r_{format}$ : Ensures correct output structure.
  - $r_{rank}$ : A soft ranking reward based on the position of the ground-truth candidate (encouraging high recall, not just top-1).
  - $r_{tool}$ : Rewards informative tool usage that leads to correct decisions while penalizing redundant or excessive tool calls.

3. Key Contributions

V-Retrver Framework: The first evidence-driven retrieval framework that enables MLLMs to actively acquire visual evidence via external tools during the reasoning process, moving beyond static visual encodings.
Multimodal Interleaved Reasoning: Introduces a paradigm where reasoning steps are explicitly grounded in dynamically acquired visual observations, effectively mitigating hallucinations in ambiguous cases.
Curriculum-Based Training: Proposes a novel three-stage training strategy (SFT + RSFT + EAPO) with a specific "Evidence-Aligned" RL objective to balance ranking accuracy with efficient tool usage.
Comprehensive Evaluation: Demonstrates state-of-the-art performance across diverse benchmarks, proving that active visual verification significantly outperforms passive, language-driven retrieval.

4. Experimental Results

The authors evaluated V-Retrver (based on Qwen2.5-VL-7B) on multiple benchmarks:

M-BEIR Benchmark (Universal Retrieval):
- V-Retrver-7B achieved a new State-of-the-Art (SOTA) with an average Recall of 69.7%, outperforming the strongest baseline (U-MARVEL-7B at 64.8%) by +4.9%.
- Significant gains were observed in fine-grained tasks (e.g., FashionIQ, CIRR), where visual details are critical.
Zero-Shot Generalization:
- On unseen datasets (CIRCO, GeneCIS, Visual Dialog), V-Retrver consistently outperformed specialized models and generalist MLLMs, demonstrating robust generalization to new domains and query formats.
Held-Out Tasks:
- Even when specific modality combinations (e.g., image-to-image) were excluded from training, V-Retrver maintained high performance, validating the decoupling of reasoning from specific input types.
Ablation Studies:
- Removing the visual tools (text-only CoT) dropped performance significantly (61.8% vs 67.2%), confirming the necessity of active visual inspection.
- All three training stages (SFT, RSFT, RL) were shown to be essential for optimal performance.

5. Significance and Impact

Paradigm Shift: V-Retrver shifts multimodal retrieval from a "static encoding" paradigm to an "active inspection" paradigm, mimicking human behavior where ambiguous candidates are resolved by "looking again."
Reliability: By grounding reasoning in actual visual evidence, the framework significantly reduces speculative hallucinations, making MLLMs more reliable for high-stakes retrieval tasks.
Agentic MLLMs: The work lays a foundation for building general-purpose agentic MLLMs that can interact with external tools to solve complex, multi-step tasks beyond simple generation or classification.
Efficiency: The curriculum learning strategy ensures that the model learns to use tools economically, calling them only when necessary to resolve ambiguity, thus balancing computational cost with performance.

In conclusion, V-Retrver demonstrates that integrating active visual verification into the reasoning loop of MLLMs is a crucial step toward achieving robust, fine-grained, and universal multimodal retrieval.