cs.CV papers | Gist.Science

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This paper identifies "Lazy Attention Localization" as a key bottleneck in multimodal cold-start training, where models fail to increase visual attention, and proposes the Attention-Guided Visual Anchoring and Reflection (AVAR) framework to effectively reshape attention distributions, achieving a 7.0% performance gain on multimodal reasoning benchmarks.

Ruilin Luo, Chufan Shi, Yizhen Zhang + 10 more2026-03-05🤖 cs.AI

Universal Pansharpening Foundation Model

This paper introduces FoundPS, a universal pansharpening foundation model that overcomes the limitations of existing satellite-specific methods by employing a modality-interleaved transformer, latent diffusion bridge, and pixel-to-latent interaction mechanisms to achieve robust, generalizable fusion across diverse sensors and scenes, supported by a new comprehensive benchmark called PSBench.

Hebaixu Wang, Jing Zhang, Haonan Guo + 4 more2026-03-05💻 cs

All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network

This paper proposes CWP-Net, a novel all-in-one image restoration framework that utilizes causal deconfounding and wavelet-disentangled prompts to eliminate spurious correlations and biased degradation estimation, thereby achieving superior performance over state-of-the-art methods.

Bingnan Wang, Bin Qin, Jiangmeng Li + 3 more2026-03-05💻 cs

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

DeepScan is a training-free framework that enhances visually grounded reasoning in Large Vision-Language Models by employing a bottom-up approach of Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning to effectively mitigate distractive contexts and achieve state-of-the-art performance across diverse architectures and scales.

Yangfu Li, Hongjian Zhan, Jiawei Chen + 3 more2026-03-05💻 cs

Bridging Human Evaluation to Infrared and Visible Image Fusion

This paper proposes a feedback reinforcement framework for infrared and visible image fusion that leverages a newly introduced large-scale human feedback dataset and a trained reward model to optimize fusion networks via Group Relative Policy Optimization, thereby significantly aligning fusion outcomes with human visual preferences.

Jinyuan Liu, Xingyuan Li, Qingyun Mei + 5 more2026-03-05💻 cs

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Yolo-Key-6D is a novel single-stage, end-to-end framework that achieves real-time monocular 6D pose estimation with competitive accuracy by integrating a keypoint-based auxiliary head for enhanced 3D geometry understanding and utilizing a continuous 9D rotation representation for stable training.

Kemal Alperen Çetiner, Hazım Kemal Ekenel2026-03-05💻 cs

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

The paper introduces UniSync, a unified lip synchronization framework that combines mask-free pose-anchored training with mask-based blending inference to achieve high-fidelity, generalizable results across diverse real-world scenarios, including stylized avatars and challenging lighting conditions, while also proposing the RealWorld-LipSync benchmark for evaluation.

Ruidi Fan, Yang Zhou, Siyuan Wang + 3 more2026-03-05💻 cs

A novel network for classification of cuneiform tablet metadata

This paper introduces a novel convolution-inspired network that effectively classifies cuneiform tablet metadata by integrating local and global information from high-resolution point clouds, outperforming the state-of-the-art Point-BERT model while addressing challenges posed by limited annotated datasets.

Frederik Hagelskjær2026-03-05🤖 cs.AI

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

This paper proposes a unified evaluation framework with new metrics (DS-F1 and DS-AURC) and an improved method (SURE+) to jointly assess and enhance classifier reliability by integrating out-of-distribution detection and in-distribution failure prediction, demonstrating that double scoring functions significantly outperform traditional single scoring approaches.

Yang Li, Youyang Sha, Yinzhi Wang + 4 more2026-03-05🤖 cs.LG

Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

This paper proposes a Modular Asynchronous Tracking Architecture (MATA) that integrates a transformer-based tracker with an Extended Kalman Filter and ego-motion compensation to address UAV tracking challenges, while introducing a hardware-independent evaluation protocol and a new Normalized Time to Failure (NT2F) metric to better quantify robustness and real-time performance on embedded systems.

Augustin Borne, Pierre Notin, Christophe Hennequin + 4 more2026-03-05💻 cs

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

This paper introduces FGAesthetics, a large-scale fine-grained image aesthetic assessment database with pairwise comparison annotations, and proposes FGAesQ, a novel framework that leverages relative ranks through specialized tokenization and alignment techniques to achieve superior discriminative performance in both fine-grained and coarse-grained aesthetic evaluation scenarios.

Zhichao Yang, Jianjie Wang, Zhixianhe Zhang + 4 more2026-03-05💻 cs

N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition

This paper proposes an N-gram Injection (NGI) method that dynamically adapts Transformer-based handwritten text recognition models to target language distributions at inference time by injecting external n-gram language models, thereby significantly reducing performance gaps caused by language shifts without requiring additional training on target data.

Florent Meyer, Laurent Guichard, Denis Coquenet + 3 more2026-03-05💻 cs

DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

The paper introduces DISC, a fully GPU-accelerated framework that utilizes a novel single-pass, distance-weighted mechanism to extract dense semantic context from CLIP embeddings, enabling efficient, real-time open-set semantic mapping that significantly outperforms existing state-of-the-art methods in accuracy and scalability.

Felix Igelbrink, Lennart Niecksch, Martin Atzmueller + 1 more2026-03-05💻 cs

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

This paper presents CMDR-IAD, a lightweight unsupervised framework that achieves state-of-the-art industrial anomaly detection by combining bidirectional 2D-3D cross-modal mapping with dual-branch reconstruction to robustly handle noisy, weak-texture, or missing modalities without relying on memory banks.

Radia Daci, Vito Renò, Cosimo Patruno + 4 more2026-03-05🤖 cs.AI

Slice-wise quality assessment of high b-value breast DWI via deep learning-based artifact detection

This study demonstrates that a DenseNet121-based deep learning model effectively detects hyper- and hypointense intensity artifacts on high b-value (1500 s/mm²) breast diffusion-weighted MRI slices, achieving high AUROC scores and providing promising results for automated quality assessment.

Ameya Markale, Luise Brock, Ihor Horishnyi + 10 more2026-03-05💻 cs

Spatial Causal Prediction in Video

This paper introduces Spatial Causal Prediction (SCP), a new task paradigm and benchmark (SCP-Bench) designed to evaluate and improve video models' ability to infer unseen spatial states and causal outcomes beyond visible observations, revealing significant gaps between current AI and human intelligence in this domain.

Yanguang Zhao, Jie Yang, Shengqiong Wu + 9 more2026-03-05💻 cs

RVN-Bench: A Benchmark for Reactive Visual Navigation

The paper introduces RVN-Bench, a new collision-aware benchmark built on Habitat 2.0 and HM3D scenes that enables the training and evaluation of safe, robust indoor visual navigation policies for mobile robots in unseen, cluttered environments.

Jaewon Lee, Jaeseok Heo, Gunmin Lee + 3 more2026-03-05🤖 cs.AI

Towards Generalized Multimodal Homography Estimation

This paper proposes a training data synthesis method that generates diverse, unaligned image pairs from single inputs alongside a novel network architecture to enhance the robustness and generalization of multimodal homography estimation across unseen domains.

Jinkun You, Jiaxin Cheng, Jie Zhang + 1 more2026-03-05🤖 cs.AI

Structural Action Transformer for 3D Dexterous Manipulation

This paper proposes the Structural Action Transformer (SAT), a novel 3D dexterous manipulation policy that reframes actions as variable-length, unordered joint trajectories and utilizes an Embodied Joint Codebook to achieve superior sample efficiency and cross-embodiment skill transfer from heterogeneous datasets.

Xiaohan Lei, Min Wang, Bohong Weng + 2 more2026-03-05💻 cs

ProFound: A moderate-sized vision foundation model for multi-task prostate imaging

The paper introduces ProFound, a domain-specialized vision foundation model pre-trained on over 22,000 prostate MRI volumes via self-supervised learning, which demonstrates superior or competitive performance across 11 diverse clinical tasks compared to state-of-the-art specialized and foundation models.

Yipei Wang, Yinsong Xu, Weixi Yi + 11 more2026-03-05💻 cs

← Previous Next →