cs.CV papers | Gist.Science

Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

This paper proposes VeilGen, an unsupervised generative model that learns latent transmission and glare maps to synthesize realistic veiling glare datasets, and DeVeiler, a restoration network that leverages these maps to effectively remove veiling glare from simplified optical systems.

Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang2026-03-09🔬 physics.optics

UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

This paper introduces the Unified Attention-Mamba (UAM) backbone, a flexible architecture that seamlessly integrates Attention and Mamba modules without manual tuning, achieving state-of-the-art performance in both tumor cell classification and image segmentation tasks on public benchmarks.

Taixi Chen, Jingyun Chen, Nancy Guo2026-03-09💻 cs

EgoCogNav: Cognition-aware Human Egocentric Navigation

The paper introduces EgoCogNav, a multimodal framework that predicts perceived path uncertainty to jointly forecast egocentric trajectories and head motion, supported by the new Cognition-aware Egocentric Navigation (CEN) dataset to better model human cognitive factors in navigation.

Zhiwen Qiu, Ziang Liu, Wenqian Niu, Tapomayukh Bhattacharjee, Saleh Kalantari2026-03-09🤖 cs.LG

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

SyncMV4D is a novel framework that overcomes the limitations of single-view and data-hungry 3D methods by introducing a Multi-view Joint Diffusion model and a Diffusion Points Aligner to simultaneously generate synchronized, realistic multi-view hand-object interaction videos and globally aligned 4D metric motions through a closed-loop coupling of visual appearance and dynamic geometry.

Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu2026-03-09💻 cs

Reversible Inversion for Training-Free Exemplar-guided Image Editing

This paper introduces ReInversion, a training-free exemplar-guided image editing method that employs a two-stage reversible denoising process and a Mask-Guided Selective Denoising strategy to achieve state-of-the-art performance with minimal computational overhead.

Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song2026-03-09💻 cs

A method for tissue-mask supported whole-body image registration in the UK Biobank

This paper presents a sex-stratified whole-body MR image registration method for the UK Biobank that leverages subcutaneous adipose tissue and muscle masks to significantly outperform existing intensity-based and deep learning approaches in anatomical alignment and correlation analysis accuracy.

Yasemin Utkueri, Elin Lundström, Håkan Ahlström, Johan Öfverstedt, Joel Kullberg2026-03-09💻 cs

UniTS: Unified Spatio-Temporal Generative Model for Remote Sensing

This paper introduces UniTS, a unified spatio-temporal generative model based on flow matching and diffusion transformers that integrates tasks like cloud removal, change detection, and forecasting into a single framework, significantly outperforming specialized models under challenging conditions.

Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia2026-03-09💻 cs

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

This paper proposes a point cloud-based framework for event-driven human pose estimation that leverages spatiotemporal properties through novel temporal slicing and sequencing modules alongside an edge-enhanced representation, achieving improved accuracy and efficiency on the DHP19 dataset without converting event streams into dense frames.

Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Pengfei Ye, Haodong Chen, Yuk Ying Chung, Qiang Qu2026-03-09🤖 cs.AI

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

DFIR-DETR is a transformer-based small object detector that addresses key limitations in standard architectures by introducing Dynamic Content-Feature Aggregation for adaptive attention, a norm-preserving Dynamic Feature Pyramid Network for detail recovery, and a Frequency-domain Iterative Refinement module to preserve high-frequency boundaries, achieving state-of-the-art performance on NEU-DET and VisDrone benchmarks with high efficiency.

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li2026-03-09🤖 cs.LG

Fast-BEV++: Fast by Algorithm, Deployable by Design

Fast-BEV++ is a vision-only Bird's-Eye-View perception framework that resolves the trade-off between accuracy and deployment efficiency by employing a hardware-oriented, kernel-free architecture to achieve a new state-of-the-art 0.488 NDS on nuScenes while delivering real-time inference at over 134 FPS.

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao2026-03-09💻 cs

Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

This paper addresses the degradation of existing subset-based visual explanation methods under out-of-distribution conditions by introducing a training-free framework that integrates layer-wise uncertainty estimation with submodular optimization to generate robust, diverse, and informative attributions.

Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan2026-03-09🤖 cs.LG

Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

Photo3D is a framework that advances photorealistic 3D generation by leveraging GPT-4o-Image data within a structure-aligned multi-view synthesis pipeline to create detail-enhanced datasets, thereby enabling realistic texture refinement while preserving geometric consistency across diverse 3D-native generators.

Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang2026-03-09💻 cs

Modular Neural Image Signal Processing

This paper introduces a modular, fully learning-based neural image signal processing (ISP) framework that offers unprecedented control over intermediate rendering stages to enhance scalability, generalization, and flexibility, enabling a user-interactive photo-editing tool capable of unlimited post-editable re-rendering with competitive performance across multiple test sets.

Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown2026-03-09💻 cs

A Novel Patch-Based TDA Approach for Computed Tomography Imaging

This paper introduces a novel patch-based Topological Data Analysis approach for 3D CT imaging that significantly outperforms traditional 3D cubical complex methods and radiomic features in both classification accuracy and computational efficiency, accompanied by the release of a Python package to facilitate its adoption.

Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Mohammad Hamghalam, Natalie Gangai, Richard K. G. Do, Alice C. Wei, Amber L. Simpson2026-03-09🤖 cs.LG

Towards Scalable Pre-training of Visual Tokenizers for Generation

This paper introduces VTP, a unified pre-training framework that optimizes visual tokenizers through joint image-text contrastive, self-supervised, and reconstruction losses to shift the latent space focus from low-level pixel accuracy to high-level semantics, thereby solving the "pre-training scaling problem" and enabling significantly improved, compute-efficient generative performance.

Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang2026-03-09💻 cs

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

This paper demonstrates that Cross-Attention over Self-Attention (CASA) is a highly competitive and efficient alternative to token insertion for vision-language models, offering near-constant memory costs and low latency that make it particularly suitable for long multi-image conversations and real-time video applications.

Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez2026-03-09🤖 cs.AI

Pretraining Frame Preservation for Lightweight Autoregressive Video History Embedding

This paper introduces a lightweight, pretrained history encoder that efficiently compresses long video histories into short embeddings using a frame query objective, enabling content-consistent autoregressive video generation under limited compute and memory constraints.

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala2026-03-09💻 cs

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

This paper introduces Spatial4D-Bench, a large-scale, multi-task benchmark comprising approximately 40,000 question-answer pairs across 18 tasks and six cognitive categories, designed to comprehensively evaluate and reveal the current limitations of Multimodal Large Language Models in achieving human-level 4D spatial intelligence.

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu2026-03-09💻 cs

Bayesian Monocular Depth Refinement via Neural Radiance Fields

The paper proposes MDENeRF, an iterative Bayesian framework that refines smooth monocular depth estimates by fusing them with high-frequency geometric details and uncertainty derived from Neural Radiance Fields, thereby enhancing scene understanding for applications like autonomous navigation.

Arun Muthukkumar2026-03-09🤖 cs.LG

FlyPose: Towards Robust Human Pose Estimation From Aerial Views

The paper introduces FlyPose, a lightweight, real-time human pose estimation pipeline optimized for aerial UAV views that achieves significant accuracy improvements across multiple datasets and is successfully deployed on-board a quadrotor, accompanied by the release of a new challenging dataset called FlyPose-104.

Hassaan Farooq, Marvin Brenner, Peter Stütz2026-03-09💻 cs

← Previous Next →