cs.CV papers | Gist.Science

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

This paper proposes a novel framework for radiology report generation that enhances reinforcement learning efficiency through a diagnostic diversity-based data sampling strategy and a Diagnostic Token-weighted Policy Optimization (DiTPO) method, achieving state-of-the-art clinical accuracy with significantly fewer training samples by prioritizing diagnostically critical content.

Zilin Lu, Ruifeng Yuan, Weiwei Cao + 6 more2026-03-05💻 cs

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

The paper proposes Volumetric Directional Diffusion (VDD), a novel framework that anchors generative trajectories to a deterministic consensus prior to predict 3D boundary residuals, thereby achieving state-of-the-art anatomically coherent uncertainty quantification for ambiguous medical image segmentation while avoiding the topological fractures common in standard diffusion models.

Chao Wu, Kangxian Xie, Mingchen Gao2026-03-05🤖 cs.AI

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

The paper proposes DQE-CIR, a novel composed image retrieval method that enhances query discriminativeness and fine-grained retrieval accuracy by integrating learnable attribute weights for precise vision-language alignment and a target relative negative sampling strategy to mitigate relevance suppression and semantic confusion.

Geon Park, Ji-Hoon Park, Seong-Whan Lee2026-03-05🤖 cs.AI

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

This paper addresses the lack of benchmarks for long-term visual localization in dynamic benthic environments by introducing a curated multi-year underwater dataset, a novel footprint-based ground-truthing method that outperforms traditional distance-threshold approaches, and a benchmark evaluation demonstrating that state-of-the-art visual place recognition methods struggle significantly in these challenging underwater settings.

Martin Kvisvik Larsen, Oscar Pizarro2026-03-05💻 cs

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

This paper introduces MELT, a lightweight backdoor attack framework for multi-encoder diffusion models like Stable Diffusion 3, demonstrating that tuning fewer than 0.2% of parameters via low-rank adapters is sufficient to achieve effective attacks while identifying the minimal encoder subsets required for different objectives.

Ziyuan Chen, Yujin Jeong, Tobias Braun + 1 more2026-03-05🤖 cs.LG

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

This study demonstrates that for cell-level histopathological image analysis under extreme spatial constraints, task-specific architectures trained on sufficient data outperform foundation models in both accuracy and efficiency, while offering comparable robustness to blur perturbations.

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi + 5 more2026-03-05💻 cs

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

EgoPoseFormer v2 is a transformer-based framework that significantly advances egocentric human motion estimation for AR/VR by combining a novel architecture with an uncertainty-aware auto-labeling system to achieve state-of-the-art accuracy and temporal consistency on large-scale unlabeled datasets.

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric + 11 more2026-03-05💻 cs

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

This paper proposes a CLIP-guided multi-task regression framework that leverages level-aware vision-language embeddings to robustly predict plant age and leaf count from multi-view imagery, achieving significant accuracy improvements on the GroMo25 benchmark while simplifying the pipeline and handling incomplete inputs.

Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo + 2 more2026-03-05💻 cs

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

This paper introduces a training-free, capture-time frame curation method for always-on egocentric cameras that leverages gaze stability and pupil-derived novelty as complementary criteria to efficiently select high-quality, informative frames, achieving full-stream classification performance with only 10% of the data while respecting wearable device constraints.

Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish2026-03-05💻 cs

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

This paper introduces HPENets, an efficient suite of MLP-based point cloud networks that leverage a two-stage abstraction-refinement paradigm, high-dimensional positional encoding, and non-local MLPs to achieve superior performance with significantly reduced computational costs compared to state-of-the-art models.

Yanmei Zou, Hongshan Yu, Yaonan Wang + 4 more2026-03-05🤖 cs.AI

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

This paper proposes a disentangled representation learning framework for brain MRI to demonstrate that demographic predictability primarily stems from anatomical variation rather than acquisition-dependent contrast, highlighting the need for targeted mitigation strategies that address these distinct sources to ensure robust bias reduction.

Mehmet Yigit Avci, Akshit Achara, Andrew King + 1 more2026-03-05🤖 cs.AI

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

This paper introduces Any2Any, a unified latent diffusion framework that enables efficient and generalizable arbitrary modality translation in remote sensing by projecting heterogeneous inputs into a shared geometrically aligned latent space, supported by the newly proposed million-scale RST-1M dataset.

Haoyang Chen, Jing Zhang, Hebaixu Wang + 7 more2026-03-05💻 cs

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

TextBoost addresses the challenge of preserving small-font scene text in ultra-low bitrate image compression by transmitting negligible OCR-derived semantic guidance to the decoder, where it is fused with image features and enforced via a regularizing loss to significantly improve text recognition fidelity without compromising global image quality.

Bingxin Wang, Yuan Lan, Zhaoyi Sun + 2 more2026-03-05💻 cs

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

This paper addresses the underexplored challenge of Few-Shot Open-Set Action Recognition in video data by proposing a Feature-Residual Discriminator (FR-Disc) that significantly improves unknown action rejection without sacrificing closed-set accuracy, establishing a new state-of-the-art benchmark across five datasets.

Stefano Berti, Giulia Pasquale, Lorenzo Natale2026-03-05💻 cs

Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab $^{+}$ is a scalable and unified audio-visual scene understanding model that overcomes the negative transfer issues of conventional multi-task methods by introducing the AV-UIE v2 dataset with explicit reasoning and an Interaction-aware LoRA mechanism to enable effective explicit cooperation across heterogeneous tasks.

Dongnuan Cai, Henghui Du, Chang Zhou + 5 more2026-03-05🤖 cs.AI

Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

This paper proposes an inference-time attention regulation framework that utilizes anatomy-aware gating and pathology-guided latent corrections to achieve anatomically consistent and precisely localized counterfactual chest X-ray synthesis, effectively overcoming the structural drift and unstable pathology expression issues common in standard diffusion-based editing methods.

Zichun Zhang, Weizhi Nie, Honglin Guo + 1 more2026-03-05💻 cs

HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

This paper proposes HBRB-BoW, a refined hierarchical training algorithm that integrates global real-valued flows to preserve high-fidelity descriptor information before final binarization, thereby overcoming the precision loss of traditional binary clustering and significantly enhancing the discriminative power and performance of ORB-SLAM in loop closing and relocalization tasks.

Minjae Lee, Sang-Min Choi, Gun-Woo Kim + 1 more2026-03-05💻 cs

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

This paper proposes a LISTA-Transformer model that integrates Learnable Iterative Shrinkage Threshold Algorithm-based sparse coding with the Transformer architecture to overcome the limitations of CNNs and standard Transformers in local and global feature modeling, achieving a 98.5% fault recognition rate on the CWRU dataset through time-frequency signal analysis.

Shuang Liu, Lina Zhao, Tian Wang + 1 more2026-03-05💻 cs

Degradation-based augmented training for robust individual animal re-identification

This paper introduces a degradation-based augmented training framework that artificially diversifies image degradations during training to significantly improve the robustness and accuracy of deep learning models for individual animal re-identification across various species and real-world conditions.

Thanos Polychronou, Lukáš Adam, Viktor Penchev + 1 more2026-03-05💻 cs

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

The paper introduces PlaneCycle, a training-free and adapter-free method that lifts pretrained 2D foundation models to 3D by cyclically distributing spatial aggregation across orthogonal planes, enabling strong 3D performance without architectural modifications or additional parameters.

Yinghong Yu, Guangyuan Li, Jiancheng Yang2026-03-05🤖 cs.AI

← Previous Next →

cs.CV