cs.CV 篇论文 | Gist.Science

Discriminative Perception via Anchored Description for Reasoning Segmentation

该论文提出了 DPAD 方法，通过强制模型生成描述性标题并利用其与上下文的语义对比来引入判别性感知，从而解决现有推理分割中推理链冗长且偏离目标的问题，显著提升了定位精度并缩短了推理长度。

Tao Yang, Qing Zhou, Yanliang Li + 1 more2026-03-05🤖 cs.AI

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

本文提出了一种结合基于诊断多样性的数据采样策略与诊断令牌加权策略优化（DiTPO）的新框架，通过优先优化临床关键信息并提升数据质量，在显著减少训练样本需求的同时实现了放射学报告生成的最先进性能。

Zilin Lu, Ruifeng Yuan, Weiwei Cao + 6 more2026-03-05💻 cs

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

本文提出了一种名为体积定向扩散（VDD）的新方法，通过将生成轨迹锚定在确定性解剖共识先验上并仅预测 3D 边界残差场，有效解决了医学图像分割中多样性与保真度的权衡难题，从而在保持高精度分割的同时显著提升了不确定性量化能力并生成了符合解剖结构的置信度图。

Chao Wu, Kangxian Xie, Mingchen Gao2026-03-05🤖 cs.AI

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

本文提出了 DQE-CIR 方法，通过引入可学习属性权重以强化文本引导的视觉特征对齐，并结合目标相对负采样策略从“中间地带”筛选高信息量负样本，从而有效解决现有组合图像检索方法中的相关性抑制与语义混淆问题，显著提升了细粒度属性修改场景下的查询判别力与检索精度。

Geon Park, Ji-Hoon Park, Seong-Whan Lee2026-03-05🤖 cs.AI

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

本文针对动态海底环境中长期视觉定位研究缺乏基准数据集和精确真值的问题，首次发布了涵盖多站点及长达六年跨度的海底数据集，提出了一种基于图像足迹的三维真值构建方法，并据此对八种先进视觉位置识别算法进行了基准测试，揭示了现有方法在该场景下的性能局限及传统距离阈值评估法的不足。

Martin Kvisvik Larsen, Oscar Pizarro2026-03-05💻 cs

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

本文针对多编码器扩散模型（如 Stable Diffusion 3）中尚未被充分研究的后门攻击问题，提出了 MELT 方法，通过仅微调少于 0.2% 的参数（低秩适配器）并冻结预训练权重，成功实现了高效且有效的轻量级后门攻击。

Ziyuan Chen, Yujin Jeong, Tobias Braun + 1 more2026-03-05🤖 cs.LG

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

该研究通过系统评估发现，在细胞级（40x40 像素）小图块病理图像分析中，当训练数据规模充足时，针对小图块优化的任务特定架构（如 CustomViT）在准确性和效率上均优于基础模型，且基础模型并未展现出更强的模糊鲁棒性。

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi + 5 more2026-03-05💻 cs

cs.CV

Discriminative Perception via Anchored Description for Reasoning Segmentation

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

Degradation-based augmented training for robust individual animal re-identification

cs.CV

Discriminative Perception via Anchored Description for Reasoning Segmentation

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

Crab+^{+}+: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

Degradation-based augmented training for robust individual animal re-identification

Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation