Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
To address the limitations of existing benchmarks in evaluating multimodal large language models' visual and textual search capabilities, this paper introduces the Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated dataset of 2,000 instances designed for realistic assessment, alongside a proposed multi-round cropped-search workflow that effectively enhances visual retrieval performance.