cs.CV papers | Gist.Science

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

The paper proposes SurgFed, a language-guided multi-task federated learning framework that utilizes Language-guided Channel Selection and Language-guided Hyper Aggregation to overcome tissue and task diversity challenges, thereby improving surgical video segmentation and depth estimation across heterogeneous clinical environments.

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming Jin2026-03-11💻 cs

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

The paper presents Context-Nav, a training-free framework for text-goal instance navigation that combines caption-driven frontier ranking for global exploration with viewpoint-aware 3D spatial verification to accurately disambiguate target objects in cluttered environments, achieving state-of-the-art performance on InstanceNav and CoIN-Bench.

Won Shik Jang, Ue-Hwan Kim2026-03-11💻 cs

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

This paper investigates the reliability of Vision-Language Models (VLMs) in autonomous driving by exposing their tendencies toward response inconsistency and weak temporal reasoning, and subsequently proposes the FutureVQA benchmark and a self-supervised chain-of-thought tuning method to enhance grounded future scene reasoning without requiring temporal labels.

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani2026-03-11💻 cs

RESBev: Making BEV Perception More Robust

The paper introduces RESBev, a plug-and-play framework that enhances the robustness of Bird's-eye-view (BEV) perception systems against sensor degradation and adversarial attacks by reframing the problem as latent semantic prediction to reconstruct clean features from corrupted observations.

Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang2026-03-11💻 cs

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

This paper proposes DCAU-Net, a novel medical image segmentation framework that combines Differential Cross Attention to efficiently model long-range dependencies while reducing computational complexity, and a Channel-Spatial Feature Fusion strategy to adaptively integrate semantic and spatial details, thereby achieving enhanced segmentation accuracy and robustness.

Yanxin Li, Hui Wan, Libin Lan2026-03-11💻 cs

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

This study demonstrates that the longitudinal progression of radiologic pleuroparenchymal fibroelastosis (PPFE), quantified via automated analysis of low-dose CT scans, independently predicts increased mortality and adverse respiratory outcomes in large lung cancer screening cohorts.

Shahab Aslani, Mehran Azimbagirad, Daryl Cheng, Daisuke Yamada, Ryoko Egashira, Adam Szmul, Justine Chan-Fook, Robert Chapman, Alfred Chung Pui So, Shanshan Wang, John McCabe, Tianqi Yang, Jose M Brenes, Eyjolfur Gudmundsson, The SUMMIT Consortium, Susan M. Astley, Daniel C. Alexander, Sam M. Janes, Joseph Jacob2026-03-11🧬 q-bio

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

This paper proposes a reinforcement learning-based post-training strategy that extends Group Relative Policy Optimization (GRPO) with hybrid and process-level rewards to enable existing unified vision-language models to generate high-quality multimodal interleaved outputs without relying on large-scale interleaved datasets.

Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang2026-03-11💻 cs

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

This paper introduces DynHiL-EQA, a new dataset for dynamic human-in-the-loop Embodied Question Answering, and proposes DIVRR, a training-free framework that enhances robustness and inference efficiency by refining ambiguous views and selectively managing memory in dynamic environments.

Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong Wang2026-03-11💻 cs

A comprehensive study of time-of-flight non-line-of-sight imaging

This paper presents a comprehensive study of Time-of-Flight non-line-of-sight imaging methods by unifying their theoretical formulations and hardware implementations to establish a common framework for analysis and demonstrate that, under equal constraints, existing techniques share similar performance limitations despite method-specific differences.

Julio Marco, Adrian Jarabo, Ji Hyun Nam, Alberto Tosi, Diego Gutierrez, Andreas Velten2026-03-11💻 cs

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

The paper introduces GeoSolver, a framework that enhances remote sensing reasoning by leveraging a large-scale process supervision dataset (Geo-PRM-2M) and a novel Process-Aware Tree-GRPO algorithm to train a token-level reward model (GeoPRM), thereby enabling verifiable, step-by-step reasoning and robust test-time scaling for both specialized and general-purpose Vision-Language Models.

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang2026-03-11💻 cs

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

The paper introduces GeoAlignCLIP, a unified framework that enhances fine-grained vision-language alignment in remote sensing by leveraging multi-granular semantic learning and intra-modal consistency, supported by a newly constructed hierarchical dataset (RSFG-100k) to outperform existing methods on diverse benchmarks.

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang2026-03-11💻 cs

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

This paper introduces the Panorama-Language Modeling (PLM) paradigm and the PanoVQA dataset to enable holistic $360^\circ$ vision-language reasoning in adverse omni-scenes, demonstrating that a unified panoramic approach yields superior understanding compared to stitching multiple narrow-field-of-view inputs.

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen2026-03-11💻 cs

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

This paper introduces BinaryAttention, a theoretically grounded method that replaces floating-point dot products with 1-bit sign-based operations and learnable biases to achieve over 2x speedup over FlashAttention2 while matching or exceeding full-precision accuracy in vision and diffusion transformers.

Chaodong Xiao, Zhengqiang Zhang, Lei Zhang2026-03-11💻 cs

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

The paper proposes ParTY, a novel framework that improves text-to-motion synthesis by introducing part-guided networks, part-aware text grounding, and holistic-part fusion to overcome the limitations of existing methods in aligning specific body part actions with text while maintaining full-body motion coherence.

KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho2026-03-11💻 cs

A saccade-inspired approach to image classification using visiontransformer attention maps

This paper proposes a saccade-inspired image classification method that leverages DINO's Vision Transformer attention maps to selectively focus processing on task-relevant regions, achieving performance comparable to or better than full-image analysis while offering a biologically plausible approach to efficient visual processing.

Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet, Benoît Miramond2026-03-11💻 cs

Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution

This paper proposes a zero-shot MRI super-resolution framework that leverages explicit, physics-driven 3D Gaussian representations and parallel brick-based rasterization to achieve high-quality, efficient reconstruction without relying on costly paired training data.

Shuting Liu, Lei Zhang, Wei Huang, Zhao Zhang, Zizhou Wang2026-03-11💻 cs

Decoder-Free Distillation for Quantized Image Restoration

This paper introduces Quantization-aware Distilled Restoration (QDR), a novel framework that employs Decoder-Free Distillation and Learnable Magnitude Reweighting to overcome bottlenecks in quantization-aware training, enabling high-performance, edge-deployed image restoration models that achieve near-FP32 accuracy and significant speedups.

S. M. A. Sharif, Abdur Rehman, Seongwan Kim, Jaeho Lee2026-03-11💻 cs

Grounding Synthetic Data Generation With Vision and Language Models

This paper proposes a vision-language grounded framework for interpretable synthetic data generation and evaluation in remote sensing, introducing the ARAS400k dataset which demonstrates that augmenting real data with synthetic images consistently outperforms real-data-only baselines in semantic segmentation and image captioning tasks.

Ümit Mert Ça\u{g}lar, Alptekin Temizel2026-03-11🤖 cs.AI

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

This paper introduces X-GS, an extensible open framework that unifies 3D Gaussian Splatting with downstream multimodal models through a real-time, semantically enriched pipeline capable of processing unposed video streams for tasks like object detection and zero-shot captioning.

Yueen Ma, Irwin King2026-03-11💬 cs.CL

OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty

This paper presents OTPL-VIO, a robust stereo visual-inertial odometry system that enhances performance in low-texture and illumination-challenging environments by employing a training-free deep descriptor with entropy-regularized optimal transport for line association and introducing adaptive uncertainty weighting to stabilize estimation.

Zikun Chen, Wentao Zhao, Yihe Niu, Tianchen Deng, Jingchuan Wang2026-03-11💻 cs

← Previous Next →