Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Vision-DeepResearch introduces a novel multimodal deep-research paradigm that leverages multi-turn, multi-entity, and multi-scale visual and textual search, trained via cold-start supervision and reinforcement learning, to significantly outperform existing models and strong closed-source foundation models in solving complex, noise-heavy real-world questions.

Wenxuan Huang, Yu Zeng, Qiuchen Wang + 13 more2026-03-03🤖 cs.AI

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

This paper introduces CaCoVID, a reinforcement learning-based token compression framework for video large language models that optimizes token selection by explicitly maximizing their contribution to correct predictions rather than relying on attention scores, thereby significantly reducing computational overhead while maintaining performance.

Yinchao Ma, Qiang Zhou, Zhibin Wang + 4 more2026-03-03🤖 cs.AI

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

To address the limitations of existing benchmarks in evaluating multimodal large language models' visual and textual search capabilities, this paper introduces the Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated dataset of 2,000 instances designed for realistic assessment, alongside a proposed multi-round cropped-search workflow that effectively enhances visual retrieval performance.

Yu Zeng, Wenxuan Huang, Zhen Fang + 14 more2026-03-03💬 cs.CL

Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

This paper benchmarks five state-of-the-art image-to-3D foundation models on medical and natural datasets, revealing that while all struggle with severe depth ambiguity in single-slice reconstruction, SAM3D best preserves topological similarity to medical shapes, ultimately demonstrating that reliable medical 3D inference requires domain-specific adaptation beyond current zero-shot capabilities.

Yan Luo, Advaith Ravishankar, Serena Liu + 2 more2026-03-03💻 cs

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs