cs.CV papers | Gist.Science

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

The paper introduces STAR, a novel autoregressive pretraining method that utilizes image separators to quadruple the input sequence length of Vision Mamba, achieving a competitive 83.5% accuracy on ImageNet-1k by effectively leveraging long-range dependencies.

Hanpeng Liu, Zidan Wang, Shuoxi Zhang + 2 more2026-03-05🤖 cs.AI

Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10

This paper proposes a lightweight underwater object detection framework based on YOLOv10 that integrates a Multi-Stage Adaptive Enhancement module, a Dual-Pooling Sequential Attention mechanism, and a Focal Generalized IoU loss to significantly improve accuracy and robustness on benchmark datasets while maintaining a compact model size suitable for resource-constrained environments.

Md. Mushibur Rahman, Umme Fawzia Rahim, Enam Ahmed Taufik2026-03-05💻 cs

Vector-Quantized Soft Label Compression for Dataset Distillation

This paper addresses the significant storage overhead of soft labels in dataset distillation by introducing a vector-quantized autoencoder (VQAE) that achieves 30–40x additional compression on benchmarks like ImageNet-1K while preserving over 90% of the original model performance.

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash + 1 more2026-03-05💻 cs

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

This paper proposes Structure-aware Prompt Adaptation (SPA), a plug-and-play method that leverages the consistent local structures of semantically related concepts in the embedding space to effectively generalize from seen to unseen attributes and objects in Open-Vocabulary Compositional Zero-Shot Learning.

Yihang Duan, Jiong Wang, Pengpeng Zeng + 5 more2026-03-05💻 cs

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This paper identifies "Lazy Attention Localization" as a key bottleneck in multimodal cold-start training, where models fail to increase visual attention, and proposes the Attention-Guided Visual Anchoring and Reflection (AVAR) framework to effectively reshape attention distributions, achieving a 7.0% performance gain on multimodal reasoning benchmarks.

Ruilin Luo, Chufan Shi, Yizhen Zhang + 10 more2026-03-05🤖 cs.AI

Universal Pansharpening Foundation Model

This paper introduces FoundPS, a universal pansharpening foundation model that overcomes the limitations of existing satellite-specific methods by employing a modality-interleaved transformer, latent diffusion bridge, and pixel-to-latent interaction mechanisms to achieve robust, generalizable fusion across diverse sensors and scenes, supported by a new comprehensive benchmark called PSBench.

Hebaixu Wang, Jing Zhang, Haonan Guo + 4 more2026-03-05💻 cs

All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network

This paper proposes CWP-Net, a novel all-in-one image restoration framework that utilizes causal deconfounding and wavelet-disentangled prompts to eliminate spurious correlations and biased degradation estimation, thereby achieving superior performance over state-of-the-art methods.

Bingnan Wang, Bin Qin, Jiangmeng Li + 3 more2026-03-05💻 cs

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

DeepScan is a training-free framework that enhances visually grounded reasoning in Large Vision-Language Models by employing a bottom-up approach of Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning to effectively mitigate distractive contexts and achieve state-of-the-art performance across diverse architectures and scales.

Yangfu Li, Hongjian Zhan, Jiawei Chen + 3 more2026-03-05💻 cs

Bridging Human Evaluation to Infrared and Visible Image Fusion

This paper proposes a feedback reinforcement framework for infrared and visible image fusion that leverages a newly introduced large-scale human feedback dataset and a trained reward model to optimize fusion networks via Group Relative Policy Optimization, thereby significantly aligning fusion outcomes with human visual preferences.

Jinyuan Liu, Xingyuan Li, Qingyun Mei + 5 more2026-03-05💻 cs

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Yolo-Key-6D is a novel single-stage, end-to-end framework that achieves real-time monocular 6D pose estimation with competitive accuracy by integrating a keypoint-based auxiliary head for enhanced 3D geometry understanding and utilizing a continuous 9D rotation representation for stable training.

Kemal Alperen Çetiner, Hazım Kemal Ekenel2026-03-05💻 cs

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

The paper introduces UniSync, a unified lip synchronization framework that combines mask-free pose-anchored training with mask-based blending inference to achieve high-fidelity, generalizable results across diverse real-world scenarios, including stylized avatars and challenging lighting conditions, while also proposing the RealWorld-LipSync benchmark for evaluation.

Ruidi Fan, Yang Zhou, Siyuan Wang + 3 more2026-03-05💻 cs

A novel network for classification of cuneiform tablet metadata

This paper introduces a novel convolution-inspired network that effectively classifies cuneiform tablet metadata by integrating local and global information from high-resolution point clouds, outperforming the state-of-the-art Point-BERT model while addressing challenges posed by limited annotated datasets.

Frederik Hagelskjær2026-03-05🤖 cs.AI

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

This paper proposes a unified evaluation framework with new metrics (DS-F1 and DS-AURC) and an improved method (SURE+) to jointly assess and enhance classifier reliability by integrating out-of-distribution detection and in-distribution failure prediction, demonstrating that double scoring functions significantly outperform traditional single scoring approaches.

Yang Li, Youyang Sha, Yinzhi Wang + 4 more2026-03-05🤖 cs.LG

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

This paper proposes a Modular Asynchronous Tracking Architecture (MATA) that integrates a transformer-based tracker with an Extended Kalman Filter and ego-motion compensation to address UAV tracking challenges, while introducing a hardware-independent evaluation protocol and a new Normalized Time to Failure (NT2F) metric to better quantify robustness and real-time performance on embedded systems.

Augustin Borne, Pierre Notin, Christophe Hennequin + 4 more2026-03-05💻 cs

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

This paper introduces FGAesthetics, a large-scale fine-grained image aesthetic assessment database with pairwise comparison annotations, and proposes FGAesQ, a novel framework that leverages relative ranks through specialized tokenization and alignment techniques to achieve superior discriminative performance in both fine-grained and coarse-grained aesthetic evaluation scenarios.

Zhichao Yang, Jianjie Wang, Zhixianhe Zhang + 4 more2026-03-05💻 cs

N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition

This paper proposes an N-gram Injection (NGI) method that dynamically adapts Transformer-based handwritten text recognition models to target language distributions at inference time by injecting external n-gram language models, thereby significantly reducing performance gaps caused by language shifts without requiring additional training on target data.

Florent Meyer, Laurent Guichard, Denis Coquenet + 3 more2026-03-05💻 cs

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

The paper introduces DISC, a fully GPU-accelerated framework that utilizes a novel single-pass, distance-weighted mechanism to extract dense semantic context from CLIP embeddings, enabling efficient, real-time open-set semantic mapping that significantly outperforms existing state-of-the-art methods in accuracy and scalability.

Felix Igelbrink, Lennart Niecksch, Martin Atzmueller + 1 more2026-03-05💻 cs

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

This paper presents CMDR-IAD, a lightweight unsupervised framework that achieves state-of-the-art industrial anomaly detection by combining bidirectional 2D-3D cross-modal mapping with dual-branch reconstruction to robustly handle noisy, weak-texture, or missing modalities without relying on memory banks.

Radia Daci, Vito Renò, Cosimo Patruno + 4 more2026-03-05🤖 cs.AI

Slice-wise quality assessment of high b-value breast DWI via deep learning-based artifact detection

This study demonstrates that a DenseNet121-based deep learning model effectively detects hyper- and hypointense intensity artifacts on high b-value (1500 s/mm²) breast diffusion-weighted MRI slices, achieving high AUROC scores and providing promising results for automated quality assessment.

Ameya Markale, Luise Brock, Ihor Horishnyi + 10 more2026-03-05💻 cs

Spatial Causal Prediction in Video

This paper introduces Spatial Causal Prediction (SCP), a new task paradigm and benchmark (SCP-Bench) designed to evaluate and improve video models' ability to infer unseen spatial states and causal outcomes beyond visible observations, revealing significant gaps between current AI and human intelligence in this domain.

Yanguang Zhao, Jie Yang, Shengqiong Wu + 9 more2026-03-05💻 cs

← Previous Next →