cs.CV papers | Gist.Science

Glass Segmentation with Fusion of Learned and General Visual Features

This paper introduces a novel dual-backbone architecture that fuses general visual features from a frozen DINOv3 model with task-specific features from a supervised Swin model to achieve state-of-the-art glass segmentation performance across multiple datasets while maintaining competitive inference speed.

Risto Ojala, Tristan Ellison, Mo Chen2026-03-05💻 cs

QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment

To address the generalization challenges in No-Reference Point Cloud Quality Assessment caused by data scarcity, this paper proposes QD-PCQA, a novel unsupervised domain adaptation framework that transfers quality priors from images to point clouds through a Rank-weighted Conditional Alignment strategy and a Quality-guided Feature Augmentation module to enhance perceptual quality ranking and feature alignment.

Guohua Zhang, Jian Jin, Meiqin Liu + 2 more2026-03-05💻 cs

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

The paper proposes PROSPECT, a unified streaming vision-language navigation agent that integrates CUT3R-based spatial encoding with SigLIP semantic features and employs latent predictive representation learning to achieve state-of-the-art performance and robustness in long-horizon navigation tasks.

Zehua Fan, Wenqi Lyu, Wenxuan Song + 12 more2026-03-05🤖 cs.AI

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

DAGE introduces a dual-stream transformer architecture that efficiently estimates accurate, view-consistent geometry and camera poses from uncalibrated multi-view inputs by disentangling global coherence in a low-resolution stream from fine details in a high-resolution stream, achieving state-of-the-art performance while supporting high resolutions and long sequences.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh + 4 more2026-03-05💻 cs

WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images

This paper proposes WSI-INR, a novel patch-free framework utilizing Implicit Neural Representations and multi-resolution hash grid encoding to model whole-slide images as continuous functions, thereby overcoming the spatial fragmentation and resolution sensitivity of existing methods to achieve robust and accurate lesion segmentation across varying scales.

Yunheng Wu, Wenqi Huang, Liangyi Wang + 4 more2026-03-05💻 cs

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

The paper introduces KFRA, a knowledge-augmented agent that emulates expert analysis through a three-stage closed reasoning loop to achieve superior open-set fine-grained visual understanding and interpretable, evidence-driven reasoning, validated by the newly constructed FGExpertBench.

Junhan Chen, Zilu Zhou, Yujun Tong + 3 more2026-03-05💻 cs

LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

DriveMVS is a novel multi-view stereo framework for autonomous driving that leverages sparse LiDAR observations as geometric prompts and employs a spatio-temporal decoder to achieve state-of-the-art metric accuracy, temporal consistency, and cross-domain generalization.

Qihao Sun, Jiarun Liu, Ziqian Ni + 5 more2026-03-05💻 cs

Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

This paper proposes a novel framework for small object detection in complex backgrounds that integrates Residual Haar Wavelet Downsampling, Global Relation Modeling, Cross-Scale Hybrid Attention, and a Center-Assisted Loss to preserve fine-grained details, suppress noise, and enhance localization accuracy, achieving state-of-the-art performance on the RGBT-Tiny benchmark.

Wenguang Tao, Xiaotian Wang, Tian Yan + 2 more2026-03-05💻 cs

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

This paper introduces TAP, a training-free framework that accelerates diffusion model inference by adaptively selecting the most accurate predictor for each token at every step based on a low-cost probe, achieving significant speedups with minimal quality loss.

Haowei Zhu, Tingxuan Huang, Xing Wang + 7 more2026-03-05🤖 cs.LG

When and Where to Reset Matters for Long-Term Test-Time Adaptation

To address model collapse and knowledge loss in long-term test-time adaptation, this paper proposes an Adaptive and Selective Reset (ASR) framework that dynamically determines optimal reset timing and scope while employing an importance-aware regularizer to recover essential knowledge and an on-the-fly adjustment scheme to enhance adaptability.

Taejun Lim, Joong-Won Hwang, Kibok Lee2026-03-05🤖 cs.AI

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

The paper introduces STAR, a novel autoregressive pretraining method that utilizes image separators to quadruple the input sequence length of Vision Mamba, achieving a competitive 83.5% accuracy on ImageNet-1k by effectively leveraging long-range dependencies.

Hanpeng Liu, Zidan Wang, Shuoxi Zhang + 2 more2026-03-05🤖 cs.AI

Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10

This paper proposes a lightweight underwater object detection framework based on YOLOv10 that integrates a Multi-Stage Adaptive Enhancement module, a Dual-Pooling Sequential Attention mechanism, and a Focal Generalized IoU loss to significantly improve accuracy and robustness on benchmark datasets while maintaining a compact model size suitable for resource-constrained environments.

Md. Mushibur Rahman, Umme Fawzia Rahim, Enam Ahmed Taufik2026-03-05💻 cs

Vector-Quantized Soft Label Compression for Dataset Distillation

This paper addresses the significant storage overhead of soft labels in dataset distillation by introducing a vector-quantized autoencoder (VQAE) that achieves 30–40x additional compression on benchmarks like ImageNet-1K while preserving over 90% of the original model performance.

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash + 1 more2026-03-05💻 cs

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

This paper proposes Structure-aware Prompt Adaptation (SPA), a plug-and-play method that leverages the consistent local structures of semantically related concepts in the embedding space to effectively generalize from seen to unseen attributes and objects in Open-Vocabulary Compositional Zero-Shot Learning.

Yihang Duan, Jiong Wang, Pengpeng Zeng + 5 more2026-03-05💻 cs

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This paper identifies "Lazy Attention Localization" as a key bottleneck in multimodal cold-start training, where models fail to increase visual attention, and proposes the Attention-Guided Visual Anchoring and Reflection (AVAR) framework to effectively reshape attention distributions, achieving a 7.0% performance gain on multimodal reasoning benchmarks.

Ruilin Luo, Chufan Shi, Yizhen Zhang + 10 more2026-03-05🤖 cs.AI

Universal Pansharpening Foundation Model

This paper introduces FoundPS, a universal pansharpening foundation model that overcomes the limitations of existing satellite-specific methods by employing a modality-interleaved transformer, latent diffusion bridge, and pixel-to-latent interaction mechanisms to achieve robust, generalizable fusion across diverse sensors and scenes, supported by a new comprehensive benchmark called PSBench.

Hebaixu Wang, Jing Zhang, Haonan Guo + 4 more2026-03-05💻 cs

All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network

This paper proposes CWP-Net, a novel all-in-one image restoration framework that utilizes causal deconfounding and wavelet-disentangled prompts to eliminate spurious correlations and biased degradation estimation, thereby achieving superior performance over state-of-the-art methods.

Bingnan Wang, Bin Qin, Jiangmeng Li + 3 more2026-03-05💻 cs

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

DeepScan is a training-free framework that enhances visually grounded reasoning in Large Vision-Language Models by employing a bottom-up approach of Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning to effectively mitigate distractive contexts and achieve state-of-the-art performance across diverse architectures and scales.

Yangfu Li, Hongjian Zhan, Jiawei Chen + 3 more2026-03-05💻 cs

Bridging Human Evaluation to Infrared and Visible Image Fusion

This paper proposes a feedback reinforcement framework for infrared and visible image fusion that leverages a newly introduced large-scale human feedback dataset and a trained reward model to optimize fusion networks via Group Relative Policy Optimization, thereby significantly aligning fusion outcomes with human visual preferences.

Jinyuan Liu, Xingyuan Li, Qingyun Mei + 5 more2026-03-05💻 cs

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Yolo-Key-6D is a novel single-stage, end-to-end framework that achieves real-time monocular 6D pose estimation with competitive accuracy by integrating a keypoint-based auxiliary head for enhanced 3D geometry understanding and utilizing a continuous 9D rotation representation for stable training.

Kemal Alperen Çetiner, Hazım Kemal Ekenel2026-03-05💻 cs

← Previous Next →