cs.CV papers | Gist.Science

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

This paper investigates the reliability of Vision-Language Models (VLMs) in autonomous driving by exposing their tendencies toward response inconsistency and weak temporal reasoning, and subsequently proposes the FutureVQA benchmark and a self-supervised chain-of-thought tuning method to enhance grounded future scene reasoning without requiring temporal labels.

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain PaganiWed, 11 Ma💻 cs

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

The paper presents Context-Nav, a training-free framework for text-goal instance navigation that combines caption-driven frontier ranking for global exploration with viewpoint-aware 3D spatial verification to accurately disambiguate target objects in cluttered environments, achieving state-of-the-art performance on InstanceNav and CoIN-Bench.

Won Shik Jang, Ue-Hwan KimWed, 11 Ma💻 cs

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

The paper proposes SurgFed, a language-guided multi-task federated learning framework that utilizes Language-guided Channel Selection and Language-guided Hyper Aggregation to overcome tissue and task diversity challenges, thereby improving surgical video segmentation and depth estimation across heterogeneous clinical environments.

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming JinWed, 11 Ma💻 cs

Streaming Autoregressive Video Generation via Diagonal Distillation

This paper introduces Diagonal Distillation, an asymmetric autoregressive framework that leverages temporal context and implicit optical flow to enable high-fidelity, real-time streaming video generation with a 277.3x speedup while mitigating error accumulation and motion incoherence.

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang LiuWed, 11 Ma💻 cs

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

This paper proposes a novel component-aware, self-refining framework that combines a Self-Attention-based Autoencoder, a Coordinate-Preserving Gated Fusion module, and a Spatially Adaptive Refinement Revisor to generate high-fidelity, semantically accurate photorealistic images from freehand sketches, significantly outperforming existing GAN and diffusion models across diverse facial and non-facial datasets.

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz QureshiWed, 11 Ma💻 cs

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

PruneSID is a training-free, synergistic importance-diversity framework that significantly enhances Vision-Language Model efficiency by employing Principal Semantic Components Analysis and Intra-group Non-Maximum Suppression to achieve state-of-the-art accuracy with extreme token compression and faster prefilling speeds.

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie PeiWed, 11 Ma💻 cs

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

This paper introduces OmniEarth, a comprehensive benchmark comprising 9,275 images and 44,210 verified instructions that evaluates Vision-Language Models across 28 geospatial tasks with a focus on perception, reasoning, and robustness, revealing significant performance gaps in current models for remote sensing applications.

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo YangWed, 11 Ma💻 cs

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

This paper introduces the Patrologia Graeca Corpus, a large-scale open resource featuring OCR-processed, lemmatized, and part-of-speech tagged text from degraded nineteenth-century bilingual Greek-Latin editions, which achieves state-of-the-art recognition accuracy and establishes a new benchmark for noisy polytonic Greek processing.

Chahan Vidal-Gorène (CJM, LIPN), Bastien KindtWed, 11 Ma💻 cs

TopoOR: A Unified Topological Scene Representation for the Operating Room

TopoOR introduces a novel topological scene representation for surgical operating rooms that leverages higher-order structures and attention mechanisms to preserve complex multimodal relationships and manifold geometry, thereby outperforming traditional graph and LLM-based methods in safety-critical tasks like sterility breach detection and robot phase prediction.

Tony Danjun Wang, Ka Young Kim, Tolga Birdal, Nassir Navab, Lennart BastianWed, 11 Ma💻 cs

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

The paper proposes GIIM, a novel graph-based framework that enhances multi-view medical image diagnosis by simultaneously modeling intra-view relationships and inter-view dynamics while effectively handling missing data to improve predictive accuracy and robustness.

Tran Bao Sam, Hung Vu, Dao Trung Kien, Tran Dat Dang, Van Ha Tang, Steven TruongWed, 11 Ma💻 cs

MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating

The paper proposes MetaDAT, a trajectory prediction framework that combines meta-learning pre-training with a data-adaptive test-time updating mechanism to achieve robust, fast, and accurate online adaptation under distribution shifts by dynamically adjusting learning rates and focusing on informative hard samples.

Yuning Wang, Pu Zhang, Yuan He, Ke Wang, Jianru XueWed, 11 Ma💻 cs

CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation

CIGPose introduces a Causal Intervention Graph Neural Network framework that enhances whole-body pose estimation robustness by using a Structural Causal Model to identify and replace context-confounded keypoint representations with invariant embeddings, thereby achieving state-of-the-art performance on COCO-WholeBody without relying on extra training data.

Bohao Li, Zhicheng Cao, Huixian Li, Yangming GuoWed, 11 Ma💻 cs

RiO-DETR: DETR for Real-time Oriented Object Detection

RiO-DETR is the first real-time oriented object detection transformer that addresses challenges in angle estimation, periodicity, and convergence through novel designs like Content-Driven Angle Estimation and Decoupled Periodic Refinement, achieving a new speed-accuracy trade-off on benchmark datasets.

Zhangchi Hu, Yifan Zhao, Yansong Peng, Wenzhang Sun, Xiangchen Yin, Jie Chen, Peixi Wu, Hebei Li, Xinghao Wang, Dongsheng Jiang, Xiaoyan SunWed, 11 Ma💻 cs

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

This paper introduces YOLO-NAS-Bench, the first surrogate benchmark for YOLO-style object detectors, which employs a self-evolving mechanism to iteratively refine a LightGBM predictor, enabling efficient and accurate discovery of high-performing architectures that surpass official YOLO baselines.

Zhe Li, Xiaoyu Ding, Jiaxin Zheng, Yongtao WangWed, 11 Ma💻 cs

Training-Free Coverless Multi-Image Steganography with Access Control

The paper proposes MIDAS, a training-free diffusion-based framework that enables coverless multi-image steganography with user-specific access control through latent-level fusion, demonstrating superior performance in image quality, robustness, and security compared to existing methods.

Minyeol Bae, Si-Hyeon LeeWed, 11 Ma💻 cs

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

EventVGGT is a novel framework that addresses the scarcity of depth annotations and temporal inconsistency in event-based monocular depth estimation by treating event streams as coherent video sequences and distilling spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) through a tri-level distillation strategy, achieving state-of-the-art performance and robust zero-shot generalization.

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, Hui XiongWed, 11 Ma💻 cs

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

SinGeo is a novel framework that achieves robust cross-view geo-localization using a single model by employing a dual discriminative learning architecture and a curriculum learning strategy, thereby overcoming the limitations of existing methods that struggle with unseen fields of view and orientations.

Yang Chen, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao WuWed, 11 Ma💻 cs

Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification

This paper introduces Evidential Perfusion Physics-Informed Neural Networks (EPPINN), a novel framework that integrates evidential deep learning with physics-informed modeling to quantify both aleatoric and epistemic uncertainties in CT perfusion imaging, thereby achieving superior accuracy and reliability in acute ischemic stroke assessment compared to existing deterministic methods.

Junhyeok Lee, Minseo Choi, Han Jang, Young Hun Jeon, Heeseong Eum, Joon Jang, Chul-Ho Sohn, Kyu Sung ChoiWed, 11 Ma💻 cs

Robust Provably Secure Image Steganography via Latent Iterative Optimization

This paper proposes a robust and provably secure image steganography framework that utilizes latent-space iterative optimization to significantly enhance message extraction accuracy under various compression and processing scenarios while maintaining security guarantees.

Yanan Li, Zixuan Wang, Qiyang Xiao, Yanzhen RenWed, 11 Ma💻 cs

Predictive Spectral Calibration for Source-Free Test-Time Regression

This paper proposes Predictive Spectral Calibration (PSC), a simple and model-agnostic source-free framework that enhances test-time adaptation for image regression by extending subspace alignment to block spectral matching, thereby achieving consistent performance improvements over strong baselines, especially under severe distribution shifts.

Nguyen Viet Tuan Kiet, Huynh Thanh Trung, Pham Huy HieuWed, 11 Ma💻 cs

← Previous Next →