cs.CV papers | Gist.Science

Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

This paper proposes two novel fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), which effectively integrate heterogeneous thermal and visual sensor data to significantly enhance unmanned aerial vehicle detection performance across diverse perspectives and resolutions.

Ishrat Jahan, Molla E Majid, M Murugappan, Muhammad E. H. Chowdhury, N. B. Prakash, Saad Bin Abul Kashem, Balamurugan Balusamy, Amith Khandakar2026-03-10💻 cs

Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA

The paper introduces Video2LoRA, a lightweight and scalable framework that utilizes a hypernetwork to generate personalized LoRA weights from reference videos, enabling efficient, zero-shot semantic-controlled video generation with strong generalization and minimal storage requirements.

Zexi Wu, Qinghe Wang, Jing Dai, Baolu Li, Yiming Zhang, Yue Ma, Xu Jia, Hongming Xu2026-03-10💻 cs

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

The paper proposes SAVE, a speech-aware video representation learning method that enhances video-text retrieval by introducing a dedicated speech branch and soft-ALBEF for early vision-audio alignment, achieving state-of-the-art performance across five benchmarks.

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li2026-03-10💻 cs

SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation

SRNeRV is a novel, parameter-efficient neural video representation framework that leverages scale self-similarity through a hybrid recursive sharing scheme to significantly reduce model size while achieving superior rate-distortion performance compared to traditional stacked multi-scale architectures.

Jia Wang, Jun Zhu, Xinfeng Zhang2026-03-10💻 cs

GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model

GarmentPainter is an efficient framework that synthesizes high-fidelity, 3D-consistent garment textures in UV space by leveraging UV position maps for structural guidance and a type selection module for character-based control, all integrated into a standard diffusion model without architectural modifications.

Jinbo Wu, Xiaobo Gao, Xing Liu, Chen Zhao, Jialun Liu2026-03-10💻 cs

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

This study leverages the MICCAI 2024 UWF4DR dataset to benchmark state-of-the-art deep learning models, including CNNs, Vision Transformers, and foundation models, in both spatial and frequency domains for image quality assessment, referable diabetic retinopathy detection, and diabetic macular edema identification using ultra-widefield imaging, demonstrating that feature-level fusion and frequency-domain representations yield robust and explainable results.

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez2026-03-10💻 cs

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

SiMO introduces a novel collaborative perception framework that utilizes Length-Adaptive Multi-Modal Fusion (LAMMA) and a "Pretrain-Align-Fuse-RD" training strategy to overcome sensor failures and semantic mismatches, ensuring robust performance across all individual modalities while maintaining effective multimodal integration.

Jiageng Wen, Shengjie Zhao, Bing Li, Jiafeng Huang, Kenan Ye, Hao Deng2026-03-10💻 cs

Topologically Stable Hough Transform

This paper proposes a topologically stable variant of the Hough transform for line detection in point clouds, which replaces the traditional discretized voting scheme with a continuous score function and utilizes persistent homology to identify candidate lines via an efficient algorithm.

Stefan Huber, Kristóf Huszár, Michael Kerber, Martin Uray2026-03-10💻 cs

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

This paper introduces DynamicVGGT, a unified feed-forward framework that extends static 3D perception to dynamic 4D scene reconstruction for autonomous driving by jointly predicting current and future point maps, utilizing a Motion-aware Temporal Attention module for temporal coherence, and employing a Dynamic 3D Gaussian Splatting Head to explicitly model point motion and refine geometry.

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, Xiangyang Xue2026-03-10💻 cs

WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

The paper proposes WaDi, a novel one-step image synthesis framework that leverages the insight that weight direction changes are more critical than norm changes during distillation, introducing the parameter-efficient LoRaD adapter to achieve state-of-the-art performance with only 10% of trainable parameters.

Lei Wang, Yang Cheng, Senmao Li, Ge Wu, Yaxing Wang, Jian Yang2026-03-10💻 cs

Event-based Motion & Appearance Fusion for 6D Object Pose Tracking

This paper proposes a learning-free method for 6D object pose tracking that fuses event-based optical flow for high-speed pose propagation with a template-based correction strategy, demonstrating superior performance over state-of-the-art algorithms in highly dynamic scenarios where traditional RGB-D cameras struggle with motion blur and low frame rates.

Zhichao Li, Chiara Bartolozzi, Lorenzo Natale, Arren Glover2026-03-10💻 cs

Prototype-Guided Concept Erasure in Diffusion Models

This paper introduces a prototype-guided approach that leverages the intrinsic embedding geometry of diffusion models to identify and cluster concept representations, enabling the reliable erasure of broad, multi-faceted concepts while preserving overall image quality.

Yuze Cai, Jiahao Lu, Hongxiang Shi, Yichao Zhou, Hong Lu2026-03-10💻 cs

OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations

The paper proposes OSCAR, a label-free method that utilizes coupled latent spaces and neural implicit representations to accurately reconstruct complete 3D vertebral anatomy from partial ultrasound images by implicitly modeling acoustic shadowing and signal transmission, achieving an 80% improvement in HD95 score over state-of-the-art techniques.

Magdalena Wysocki, Kadir Burak Buldu, Miruna-Alexandra Gafencu, Mohammad Farid Azampour, Nassir Navab2026-03-10💻 cs

Novel Semantic Prompting for Zero-Shot Action Recognition

The paper introduces SP-CLIP, a lightweight zero-shot action recognition framework that significantly improves performance on fine-grained and compositional actions by augmenting frozen vision-language models with structured, multi-level semantic prompts without requiring any additional parameter training or visual encoder modifications.

Salman Iqbal, Waheed Rehman2026-03-10💻 cs

Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

This paper proposes a retrieval-augmented framework for text-to-CT generation that leverages a 3D vision-language encoder to retrieve semantically related clinical cases and their anatomical annotations as structural proxies, thereby enhancing image fidelity and spatial controllability in a realistic inference setting without requiring ground-truth annotations.

Daniele Molino, Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi2026-03-10💻 cs

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

This paper introduces a concept-guided fine-tuning framework that enhances Vision Transformer robustness against distribution shifts by automatically generating and aligning model attention with fine-grained semantic concepts rather than spurious background correlations.

Yehonatan Elisha, Oren Barkan, Noam Koenigstein2026-03-10🤖 cs.LG

HDR-NSFF: High Dynamic Range Neural Scene Flow Fields

HDR-NSFF is a novel framework that reconstructs dynamic high dynamic range (HDR) radiance fields from monocular videos by shifting from 2D pixel alignment to 4D spatio-temporal modeling, utilizing exposure-invariant motion estimation and generative priors to achieve state-of-the-art view synthesis while introducing the first real-world HDR-GoPro dataset for evaluation.

Shin Dong-Yeon, Kim Jun-Seong, Kwon Byung-Ki, Tae-Hyun Oh2026-03-10💻 cs

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

This paper introduces SlowBA, a novel backdoor attack against VLM-based GUI agents that utilizes a two-stage reward-level injection strategy and realistic pop-up triggers to induce excessive reasoning chains, thereby significantly increasing response latency while maintaining task accuracy and evading existing defenses.

Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu2026-03-10💬 cs.CL

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

This paper presents a large-scale comparative study using the Epic ReduAct dataset and over 3,000 human participants to demonstrate that while humans rely on sparse, semantically critical cues for egocentric action recognition, state-of-the-art AI models degrade more gradually by depending on contextual and low-level features, revealing fundamental divergences in how humans and machines process spatial and spatiotemporal information.

Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong, Anya C. Hurlbert, Frank Guerin, Andrew Gilbert2026-03-10💻 cs

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

This paper introduces a label-free framework for evaluating Multiple Instance Learning heatmaps in histopathology, demonstrating through a large-scale benchmark that perturbation, LRP, and integrated gradients outperform attention-based methods, thereby enabling more reliable model validation and biological discovery.

Mina Jamshidi Idaji, Julius Hense, Tom Neuhäuser, Augustin Krause, Yanqing Luo, Oliver Eberle, Thomas Schnake, Laure Ciernik, Farnoush Rezaei Jafari, Reza Vahidimajd, Jonas Dippel, Christoph Walz, Frederick Klauschen, Andreas Mock, Klaus-Robert Müller2026-03-10🤖 cs.LG

← Previous Next →