cs.CV papers | Gist.Science

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

To address the limitations of existing benchmarks in evaluating multimodal large language models' visual and textual search capabilities, this paper introduces the Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated dataset of 2,000 instances designed for realistic assessment, alongside a proposed multi-round cropped-search workflow that effectively enhances visual retrieval performance.

Yu Zeng, Wenxuan Huang, Zhen Fang + 14 more2026-03-03💬 cs.CL

Investigating Disability Representations in Text-to-Image Models

This study investigates the underexplored representation of disability in text-to-image models like Stable Diffusion XL and DALL-E 3, revealing persistent imbalances and affective framing issues that highlight the urgent need for continuous evaluation and refinement to foster more inclusive portrayals.

Yang Tian, Yu Fan, Liudmila Zavolokina + 1 more2026-03-03💬 cs.CL

RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

The paper introduces RFDM, a residual flow diffusion model that enables efficient, causal, and variable-length video editing by adapting 2D image-to-image diffusion to predict frame residuals, achieving performance comparable to 3D models with significantly lower computational costs.

Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale + 4 more2026-03-03💻 cs

Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

This paper benchmarks five state-of-the-art image-to-3D foundation models on medical and natural datasets, revealing that while all struggle with severe depth ambiguity in single-slice reconstruction, SAM3D best preserves topological similarity to medical shapes, ultimately demonstrating that reliable medical 3D inference requires domain-specific adaptation beyond current zero-shot capabilities.

Yan Luo, Advaith Ravishankar, Serena Liu + 2 more2026-03-03💻 cs

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

EchoTorrent is a novel multi-modal video generation framework that overcomes latency and temporal stability challenges through a fourfold design involving multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement to enable swift, sustained, and high-fidelity streaming inference with precise audio-lip synchronization.

Rang Meng, Yingjie Yin, Yuming Li + 1 more2026-03-03💻 cs

Deformation-Free Cross-Domain Image Registration via Position-Encoded Temporal Attention

This paper introduces GPEReg-Net, a novel framework for deformation-free cross-domain image registration that factorizes images into domain-invariant scene and appearance components recombined via AdaIN, enhanced by a position-encoded temporal attention mechanism to achieve state-of-the-art performance and efficiency on retinal and synthetic benchmarks.

Yiwen Wang, Jiahao Qin2026-03-03🤖 cs.AI

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

OmniCT introduces a unified slice-volume Large Vision-Language Model that overcomes the limitations of existing fragmented approaches by integrating spatial consistency and organ-level semantic enhancements to achieve comprehensive, high-precision CT analysis across both local and global clinical tasks.

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang + 12 more2026-03-03🤖 cs.AI

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS is a novel domain adaptive segmentation framework for electron microscopy that leverages sparse prompts and local preference alignment through self-training and contrastive learning to achieve high-performance, flexible segmentation with minimal annotation requirements.

Jiabao Chen, Shan Xiong, Jialin Peng2026-03-03💻 cs

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

The paper introduces Hepato-LLaVA, a specialized Multi-modal Large Language Model featuring a novel Sparse Topo-Pack Attention mechanism and the clinically validated HepatoPathoVQA dataset, to achieve state-of-the-art performance in hepatocellular carcinoma diagnosis and captioning on gigapixel whole slide images by effectively addressing resolution constraints and feature aggregation inefficiencies.

Yuxuan Yang, Zhonghao Yan, Yi Zhang + 6 more2026-03-03💻 cs

Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models

This paper introduces a causal reasoning-based explanation framework for medical image segmentation that quantifies the influence of input regions and network components via average treatment effects, demonstrating superior faithfulness over existing methods and revealing significant heterogeneity in model perceptual strategies.

Limai Jiang, Ruitao Xie, Bokai Yang + 6 more2026-03-03💻 cs

Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

This paper proposes Dataset Color Quantization (DCQ), a training-oriented framework that significantly compresses large-scale image datasets by reducing color-space redundancy while preserving semantically important colors and structural details to maintain or improve model training performance.

Chenyue Yu, Lingao Xiao, Jinhong Deng + 2 more2026-03-03🤖 cs.AI

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

This paper introduces Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that exploits the visual instruction-following capabilities of Image-to-Video models by disguising malicious text prompts as benign visual cues in reference images, achieving high attack success rates across state-of-the-art commercial systems.

Bowen Zheng, Yongli Xiang, Ziming Hong + 4 more2026-03-03💻 cs

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

HorizonForge is a unified framework that enables photorealistic, controllable driving scene generation with arbitrary trajectories and vehicles by combining editable Gaussian-Mesh representations and noise-aware video diffusion, significantly outperforming existing methods in fidelity and consistency while introducing the HorizonSuite benchmark for standardized evaluation.

Yifan Wang, Francesco Pittaluga, Zaid Tasneem + 3 more2026-03-03💻 cs

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

This paper introduces Light-Geometry Interaction (LGI) maps, a novel representation derived from monocular depth that encodes light-aware occlusion to enable a unified, physics-consistent pipeline for joint shadow generation and relighting, addressing common artifacts like floating shadows through a bridge-matching generative model trained on a newly curated large-scale benchmark.

Shan Wang, Peixia Li, Chenchen Xu + 4 more2026-03-03💻 cs

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

PhotoAgent is an autonomous image editing system that leverages explicit aesthetic planning, tree search, and closed-loop visual feedback to execute multi-step editing tasks without requiring detailed user prompts, supported by the newly introduced UGC-Edit benchmark for evaluation.

Mingde Yao, Zhiyuan You, King-Man Tam + 2 more2026-03-03💻 cs

OmniGAIA: Towards Native Omni-Modal AI Agents

This paper introduces OmniGAIA, a comprehensive benchmark for evaluating omni-modal agents on complex reasoning and tool-use tasks across video, audio, and image modalities, alongside OmniAtlas, a native omni-modal foundation agent trained with advanced strategies to bridge the gap between current bi-modal models and next-generation real-world AI assistants.

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin + 8 more2026-03-03💬 cs.CL

HELMLAB: An Analytical, Data-Driven Color Space for Perceptual Distance in UI Design Systems

This paper introduces HELMLAB, a novel 72-parameter analytical color space for UI design systems that utilizes learned transformations and perceptual corrections to achieve a 20.2% improvement in perceptual distance accuracy over CIEDE2000 while maintaining invertibility and providing practical utilities for design workflows.

Gorkem Yildiz2026-03-03💻 cs

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

The paper introduces AgentVista, a comprehensive benchmark spanning 25 sub-domains that evaluates generalist multimodal agents on ultra-challenging, realistic long-horizon tasks requiring hybrid tool use, revealing significant performance gaps in current state-of-the-art models.

Zhaochen Su, Jincheng Gao, Hangyu Guo + 10 more2026-03-03💻 cs

V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space

This paper introduces V-MORALS, a novel method that estimates Regions of Attraction in a learned latent space using only image-based trajectory data and Morse Graphs, thereby overcoming the limitations of existing approaches that require full state knowledge or known system dynamics.

Faiz Aladin, Ashwin Balasubramanian, Lars Lindemann + 1 more2026-03-03🤖 cs.LG

Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis

The paper proposes HMKGN, a hierarchical multi-scale graph network that leverages knowledge-guided attention to model spatially organized, multi-scale interactions within whole-slide images, significantly outperforming existing methods in cancer survival prediction across four TCGA cohorts.

Bin Xu, Yufei Zhou, Boling Song + 6 more2026-03-03⚡ eess

← Previous Next →