cs.CV papers | Gist.Science

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

The paper introduces KFRA, a knowledge-augmented agent that emulates expert analysis through a three-stage closed reasoning loop to achieve superior open-set fine-grained visual understanding and interpretable, evidence-driven reasoning, validated by the newly constructed FGExpertBench.

Junhan Chen, Zilu Zhou, Yujun Tong + 3 more2026-03-05💻 cs

LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

DriveMVS is a novel multi-view stereo framework for autonomous driving that leverages sparse LiDAR observations as geometric prompts and employs a spatio-temporal decoder to achieve state-of-the-art metric accuracy, temporal consistency, and cross-domain generalization.

Qihao Sun, Jiarun Liu, Ziqian Ni + 5 more2026-03-05💻 cs

Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

This paper proposes a novel framework for small object detection in complex backgrounds that integrates Residual Haar Wavelet Downsampling, Global Relation Modeling, Cross-Scale Hybrid Attention, and a Center-Assisted Loss to preserve fine-grained details, suppress noise, and enhance localization accuracy, achieving state-of-the-art performance on the RGBT-Tiny benchmark.

Wenguang Tao, Xiaotian Wang, Tian Yan + 2 more2026-03-05💻 cs

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

This paper introduces TAP, a training-free framework that accelerates diffusion model inference by adaptively selecting the most accurate predictor for each token at every step based on a low-cost probe, achieving significant speedups with minimal quality loss.

Haowei Zhu, Tingxuan Huang, Xing Wang + 7 more2026-03-05🤖 cs.LG

When and Where to Reset Matters for Long-Term Test-Time Adaptation

To address model collapse and knowledge loss in long-term test-time adaptation, this paper proposes an Adaptive and Selective Reset (ASR) framework that dynamically determines optimal reset timing and scope while employing an importance-aware regularizer to recover essential knowledge and an on-the-fly adjustment scheme to enhance adaptability.

Taejun Lim, Joong-Won Hwang, Kibok Lee2026-03-05🤖 cs.AI

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

The paper introduces STAR, a novel autoregressive pretraining method that utilizes image separators to quadruple the input sequence length of Vision Mamba, achieving a competitive 83.5% accuracy on ImageNet-1k by effectively leveraging long-range dependencies.

Hanpeng Liu, Zidan Wang, Shuoxi Zhang + 2 more2026-03-05🤖 cs.AI

Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10

This paper proposes a lightweight underwater object detection framework based on YOLOv10 that integrates a Multi-Stage Adaptive Enhancement module, a Dual-Pooling Sequential Attention mechanism, and a Focal Generalized IoU loss to significantly improve accuracy and robustness on benchmark datasets while maintaining a compact model size suitable for resource-constrained environments.

Md. Mushibur Rahman, Umme Fawzia Rahim, Enam Ahmed Taufik2026-03-05💻 cs

Vector-Quantized Soft Label Compression for Dataset Distillation

This paper addresses the significant storage overhead of soft labels in dataset distillation by introducing a vector-quantized autoencoder (VQAE) that achieves 30–40x additional compression on benchmarks like ImageNet-1K while preserving over 90% of the original model performance.

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash + 1 more2026-03-05💻 cs

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

This paper proposes Structure-aware Prompt Adaptation (SPA), a plug-and-play method that leverages the consistent local structures of semantically related concepts in the embedding space to effectively generalize from seen to unseen attributes and objects in Open-Vocabulary Compositional Zero-Shot Learning.

Yihang Duan, Jiong Wang, Pengpeng Zeng + 5 more2026-03-05💻 cs

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This paper identifies "Lazy Attention Localization" as a key bottleneck in multimodal cold-start training, where models fail to increase visual attention, and proposes the Attention-Guided Visual Anchoring and Reflection (AVAR) framework to effectively reshape attention distributions, achieving a 7.0% performance gain on multimodal reasoning benchmarks.

Ruilin Luo, Chufan Shi, Yizhen Zhang + 10 more2026-03-05🤖 cs.AI

Universal Pansharpening Foundation Model

This paper introduces FoundPS, a universal pansharpening foundation model that overcomes the limitations of existing satellite-specific methods by employing a modality-interleaved transformer, latent diffusion bridge, and pixel-to-latent interaction mechanisms to achieve robust, generalizable fusion across diverse sensors and scenes, supported by a new comprehensive benchmark called PSBench.

Hebaixu Wang, Jing Zhang, Haonan Guo + 4 more2026-03-05💻 cs

All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network

This paper proposes CWP-Net, a novel all-in-one image restoration framework that utilizes causal deconfounding and wavelet-disentangled prompts to eliminate spurious correlations and biased degradation estimation, thereby achieving superior performance over state-of-the-art methods.

Bingnan Wang, Bin Qin, Jiangmeng Li + 3 more2026-03-05💻 cs

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

DeepScan is a training-free framework that enhances visually grounded reasoning in Large Vision-Language Models by employing a bottom-up approach of Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning to effectively mitigate distractive contexts and achieve state-of-the-art performance across diverse architectures and scales.

Yangfu Li, Hongjian Zhan, Jiawei Chen + 3 more2026-03-05💻 cs

Bridging Human Evaluation to Infrared and Visible Image Fusion

This paper proposes a feedback reinforcement framework for infrared and visible image fusion that leverages a newly introduced large-scale human feedback dataset and a trained reward model to optimize fusion networks via Group Relative Policy Optimization, thereby significantly aligning fusion outcomes with human visual preferences.

Jinyuan Liu, Xingyuan Li, Qingyun Mei + 5 more2026-03-05💻 cs

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Yolo-Key-6D is a novel single-stage, end-to-end framework that achieves real-time monocular 6D pose estimation with competitive accuracy by integrating a keypoint-based auxiliary head for enhanced 3D geometry understanding and utilizing a continuous 9D rotation representation for stable training.

Kemal Alperen Çetiner, Hazım Kemal Ekenel2026-03-05💻 cs

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

The paper introduces UniSync, a unified lip synchronization framework that combines mask-free pose-anchored training with mask-based blending inference to achieve high-fidelity, generalizable results across diverse real-world scenarios, including stylized avatars and challenging lighting conditions, while also proposing the RealWorld-LipSync benchmark for evaluation.

Ruidi Fan, Yang Zhou, Siyuan Wang + 3 more2026-03-05💻 cs

A novel network for classification of cuneiform tablet metadata

This paper introduces a novel convolution-inspired network that effectively classifies cuneiform tablet metadata by integrating local and global information from high-resolution point clouds, outperforming the state-of-the-art Point-BERT model while addressing challenges posed by limited annotated datasets.

Frederik Hagelskjær2026-03-05🤖 cs.AI

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

This paper proposes a unified evaluation framework with new metrics (DS-F1 and DS-AURC) and an improved method (SURE+) to jointly assess and enhance classifier reliability by integrating out-of-distribution detection and in-distribution failure prediction, demonstrating that double scoring functions significantly outperform traditional single scoring approaches.

Yang Li, Youyang Sha, Yinzhi Wang + 4 more2026-03-05🤖 cs.LG

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

This paper proposes a Modular Asynchronous Tracking Architecture (MATA) that integrates a transformer-based tracker with an Extended Kalman Filter and ego-motion compensation to address UAV tracking challenges, while introducing a hardware-independent evaluation protocol and a new Normalized Time to Failure (NT2F) metric to better quantify robustness and real-time performance on embedded systems.

Augustin Borne, Pierre Notin, Christophe Hennequin + 4 more2026-03-05💻 cs

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

This paper introduces FGAesthetics, a large-scale fine-grained image aesthetic assessment database with pairwise comparison annotations, and proposes FGAesQ, a novel framework that leverages relative ranks through specialized tokenization and alignment techniques to achieve superior discriminative performance in both fine-grained and coarse-grained aesthetic evaluation scenarios.

Zhichao Yang, Jianjie Wang, Zhixianhe Zhang + 4 more2026-03-05💻 cs

← Previous Next →