cs.CV papers | Gist.Science

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

The paper proposes SemVID, a training-free token pruning framework for Video Temporal Grounding that maintains high accuracy and efficiency by allocating token budgets based on query relevance and inter-frame variation while preserving critical evidence and cross-frame connectivity through the strategic selection of object, motion, and context tokens.

Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, Yu Guan2026-03-09💻 cs

Gabor Primitives for Accelerated Cardiac Cine MRI Reconstruction

This paper proposes a cardiac cine MRI reconstruction method using Gabor primitives, which combine Gaussian envelopes with complex exponentials to enable flexible k-space coverage and a low-rank spatiotemporal decomposition, achieving superior performance over compressed sensing, Gaussian primitives, and implicit neural representations while offering physically interpretable parameters.

Wenqi Huang, Veronika Spieker, Nil Stolt-Ansó, Natascha Niessen, Maik Dannecker, Sevgi Gokce Kafali, Sila Kurugol, Julia A. Schnabel, Daniel Rueckert2026-03-09💻 cs

OWL: A Novel Approach to Machine Perception During Motion

This paper introduces OWL, a novel analytical function that enables real-time, scaled 3D scene reconstruction and camera heading estimation from raw visual motion cues alone, without requiring prior knowledge of the environment or camera motion, thereby bridging theoretical perception concepts with practical applications in robotics and autonomous navigation.

Daniel Raviv, Juan D. Yepes2026-03-09💻 cs

Longitudinal Lesion Inpainting in Brain MRI via 3D Region Aware Diffusion

This paper introduces a novel pseudo-3D longitudinal inpainting framework based on Denoising Diffusion Probabilistic Models and Region-Aware Diffusion that significantly outperforms state-of-the-art baselines in perceptual fidelity, temporal stability, and processing speed for removing evolving lesions from brain MRI scans.

Zahra Karimaghaloo, Dumitru Fetco, Haz-Edine Assemlal, Hassan Rivaz, Douglas L. Arnold2026-03-09🤖 cs.AI

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

The paper introduces MultiHaystack, a new benchmark comprising over 46,000 multimodal documents, images, and videos to evaluate the critical gap between retrieval and reasoning in multimodal large language models, revealing that current systems struggle significantly when required to locate evidence within large-scale, heterogeneous corpora rather than being provided with it directly.

Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng2026-03-09💻 cs

Interpretable Perception and Reasoning for Audiovisual Geolocation

This paper introduces a novel framework for interpretable audiovisual geolocation that combines a high-quality global video benchmark (AVG) with a three-stage process of decomposing audio into semantic "acoustic atoms," multimodal reasoning via GRPO-finetuned MLLMs, and Riemannian Flow Matching to achieve significantly higher precision than unimodal baselines.

Yiyang Su, Xiaoming Liu2026-03-09💻 cs

Any to Full: Prompting Depth Anything for Depth Completion in One Stage

Any2Full is a one-stage, domain-general framework that reformulates depth completion as a scale-prompting adaptation of pretrained monocular depth estimation models via a Scale-Aware Prompt Encoder, achieving superior robustness and efficiency by eliminating the computational overhead and distortions of traditional two-stage alignment methods.

Zhiyuan Zhou, Ruofeng Liu, Taichi Liu, Weijian Zuo, Shanshan Wang, Zhiqing Hong, Desheng Zhang2026-03-09💻 cs

Interpretable Motion Artificat Detection in structural Brain MRI

This paper proposes a lightweight, interpretable framework that extends the Discriminative Histogram of Gradient Magnitude to 3D space, combining slice-level and volume-level features with a minimal-parameter classifier to achieve robust and accurate detection of motion artifacts in structural brain MRI across diverse acquisition sites.

Naveetha Nithianandam, Prabhjot Kaur, Anil Kumar Sao2026-03-09💻 cs

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

This paper introduces an automated, human-free pipeline using self-supervised Vision Transformers to convert the ImageNet training set into a high-quality multi-label dataset, which significantly improves both in-domain classification accuracy and downstream transfer performance compared to traditional single-label supervision.

Junyu Chen, Md Yousuf Harun, Christopher Kanan2026-03-09💻 cs

From Phase Grounding to Intelligent Surgical Narratives

This paper proposes a CLIP-based multi-modal framework that automatically generates structured surgical timelines and narratives by aligning video frames with textual gesture descriptions, thereby eliminating the need for time-consuming manual annotation or vague post-operative reports.

Ethan Peterson, Huixin Zhan2026-03-09💻 cs

Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression

Uni-LVC is a unified learned video compression framework that integrates intra and inter coding into a single model by conditioning inter-coding on temporal cues via a cross-attention module and a reliability-aware classifier, thereby achieving superior rate-distortion performance across low-delay and random-access scenarios while maintaining computational efficiency.

Yichi Zhang, Ruoyu Yang, Fengqing Zhu2026-03-09💻 cs

Full Dynamic Range Sky-Modelling For Image Based Lighting

This paper introduces Icarus, a deep learning-based all-weather sky model that overcomes the limitations of existing methods in handling full dynamic range and class-imbalanced solar regions to generate photorealistic, user-controllable environment maps for accurate Image-Based Lighting.

Ian J. Maquignaz2026-03-09🤖 cs.LG

Bridging Domains through Subspace-Aware Model Merging

This paper introduces SCORE, a novel model merging method that resolves singular subspace conflicts between domain-specific models by projecting them into a shared orthogonal basis, thereby significantly improving generalization to unseen domains compared to existing approaches.

Levy Chaves, Chao Zhou, Rebekka Burkholz, Eduardo Valle, Sandra Avila2026-03-09🤖 cs.AI

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

This paper introduces LayerBind, a training-free and plug-and-play method for Diffusion Transformers that achieves precise regional and occlusion control in text-to-image generation by modeling distinct object instances as separate layers during early denoising stages and fusing them through a semantic nursing mechanism.

Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, Anan Liu2026-03-09💻 cs

Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

The paper introduces BM25-V, a two-stage image retrieval framework that leverages sparse visual-word activations from a Sparse Auto-Encoder combined with Okapi BM25 scoring to achieve near-dense accuracy with high interpretability and computational efficiency.

Donghoon Han, Eunhwan Park, Seunghyeon Seo2026-03-09🤖 cs.AI

Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction

This paper introduces a spectral diagnostic framework to reveal that preserving spectral structure, rather than merely enhancing spatial details, is the critical factor for achieving high-quality 3D reconstruction in 2D-to-3D pipelines, demonstrating that structural spectral consistency is the strongest predictor of novel view synthesis performance.

Ling Xiao, Yuliang Xiu, Yue Chen, Guoming Wang, Toshihiko Yamasaki2026-03-09💻 cs

EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

This paper introduces EventGeM, a state-of-the-art, real-time Visual Place Recognition system for event cameras that fuses global ViT features, local MaxViT keypoints, and depth-based structural similarity to achieve superior localization accuracy across diverse lighting conditions and benchmark datasets.

Adam D. Hines, Gokul B. Nair, Nicolás Marticorena, Michael Milford, Tobias Fischer2026-03-09💻 cs

Training-free Latent Inter-Frame Pruning with Attention Recovery

This paper introduces LIPAR, a training-free framework that accelerates video generation by pruning redundant latent patches and recovering attention values to maintain quality, thereby achieving a 1.45x throughput increase without compromising visual fidelity.

Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir, Feng Liang, Radu Marculescu, Chenfeng Xu, Diana Marculescu2026-03-09💻 cs

Margin and Consistency Supervision for Calibrated and Robust Vision Models

This paper introduces Margin and Consistency Supervision (MaCS), an architecture-agnostic regularization framework that combines a hinge-squared margin penalty and a consistency regularizer to simultaneously enhance the calibration, robustness, and generalization of deep vision models without requiring additional data or architectural changes.

Salim Khazem2026-03-09🤖 cs.AI

Architectural Unification for Polarimetric Imaging Across Multiple Degradations

This paper proposes a unified, single-stage architectural framework that jointly processes image and Stokes domains to achieve state-of-the-art performance in recovering polarimetric parameters from various degraded observations, including low-light noise, motion blur, and mosaicing artifacts, while ensuring physical consistency and avoiding error accumulation.

Chu Zhou, Yufei Han, Junda Liao, Linrui Dai, Wangze Xu, Art Subpa-Asa, Heng Guo, Boxin Shi, Imari Sato2026-03-09💻 cs

← Previous Next →