Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

To address the limitations of existing benchmarks in evaluating multimodal large language models' visual and textual search capabilities, this paper introduces the Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated dataset of 2,000 instances designed for realistic assessment, alongside a proposed multi-round cropped-search workflow that effectively enhances visual retrieval performance.

Yu Zeng, Wenxuan Huang, Zhen Fang + 14 more2026-03-03💬 cs.CL

Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

This paper benchmarks five state-of-the-art image-to-3D foundation models on medical and natural datasets, revealing that while all struggle with severe depth ambiguity in single-slice reconstruction, SAM3D best preserves topological similarity to medical shapes, ultimately demonstrating that reliable medical 3D inference requires domain-specific adaptation beyond current zero-shot capabilities.

Yan Luo, Advaith Ravishankar, Serena Liu + 2 more2026-03-03💻 cs

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs