cs.CV 件の論文 | Gist.Science

Phi-4-reasoning-vision-15B Technical Report

本論文は、高品質なデータキュレーション、高解像度のエンコーダ、および推論モードと直接回答モードを切り替えるハイブリッド設計により、限られた計算資源で科学・数学的推論や UI 理解に優れた性能を発揮するコンパクトなオープンウェイト多モーダルモデル「Phi-4-reasoning-vision-15B」の開発と、その設計思想を報告するものです。

Jyoti Aneja, Michael Harrison, Neel Joshi + 3 more2026-03-05🤖 cs.AI

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

本論文は、リモートセンシング画像における推論駆動セグメンテーションの課題を解決するため、教師なしでバイアス補正と二重経路プロンプティングを組み合わせたフレームワーク「GeoSeg」と、その性能を評価する新しいベンチマーク「GeoSeg-Bench」を提案し、既存手法を上回る性能を実証しています。

Lifan Jiang, Yuhang Pei, oxi Wu + 5 more2026-03-05🤖 cs.AI

RIVER: A Real-Time Interaction Benchmark for Video LLMs

本論文は、オフラインパラダイムに依存する既存のマルチモーダル大規模言語モデルの限界を克服し、リアルタイムな双方向性を備えた動画理解を評価・促進するための新たなベンチマーク「RIVER」を提案し、その評価を通じて長期記憶や未来予測の課題を特定し、リアルタイム対話を可能にする汎用的な改善手法を提示したものである。

Yansong Shi, Qingsong Zhao, Tianxiang Jiang + 3 more2026-03-05💻 cs

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

本論文は、顔のパレイドリア現象を用いた診断フレームワークを提案し、視覚モデルが曖昧な視覚証拠を解釈する際、検出モデルの保守的抑制や ViT の不確実性に基づく棄却とは異なり、VLM が「人間」概念への意味的過活性化を示すことを明らかにし、この挙動がスコア閾値ではなく表現の選択に依存し、不確実性とバイアスが分離していることを示しています。

Qianpu Chen, Derya Soydaner, Rob Saunders2026-03-05🤖 cs.AI

Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

この論文は、専門家の注釈が限られている糖尿病網膜症のスクリーニングにおいて、特徴空間アンサンブルに基づく二段階フレームワーク「SAFE」を導入することで、未注釈の病変領域を系統的に拡張し、下流タスクの性能を大幅に向上させることを提案しています。

Shramana Dey, Abhirup Banerjee, B. Uma Shankar + 2 more2026-03-05💻 cs

Discriminative Perception via Anchored Description for Reasoning Segmentation

本論文は、推論セグメンテーションにおいてマルチモーダル大規模言語モデルの推論連鎖が対象領域から逸脱する問題を解決するため、対象の記述的キャプションを生成し文脈との対比を通じて「識別的知覚」を強制する DPAD を提案し、これにより性能向上と推論の短縮を同時に達成したことを示しています。

Tao Yang, Qing Zhou, Yanliang Li + 1 more2026-03-05🤖 cs.AI

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

本論文は、放射線レポート生成タスクにおいて、診断的多様性に基づくデータサンプリング戦略と臨床的に重要なトークンを重点的に最適化する DiTPO 手法を提案することで、従来の強化学習よりも少ないデータ量で臨床精度を大幅に向上させる新しい枠組みを提示しています。

Zilin Lu, Ruifeng Yuan, Weiwei Cao + 6 more2026-03-05💻 cs

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

本論文は、曖昧な医療画像セグメンテーションにおける不確実性を定量化しつつ解剖学的整合性を保つため、決定論的なコンセンサス事前分布を生成軌道に固定し、3D 境界残差場を予測する「体積方向拡散（VDD）」を提案し、複数のデータセットで最先端の性能を実証したものである。

Chao Wu, Kangxian Xie, Mingchen Gao2026-03-05🤖 cs.AI

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

本論文は、合成画像検索において既存の対照学習が抱える関連性の抑制や意味的混同の課題を解決するため、修正テキストに条件付けられた学習可能属性重みと、中程度の難易度を持つネガティブサンプルを抽出するターゲット相対ネガティブサンプリングを導入し、高弁別性を持つクエリ埋め込みを学習する手法「DQE-CIR」を提案するものである。

Geon Park, Ji-Hoon Park, Seong-Whan Lee2026-03-05🤖 cs.AI

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

この論文は、長期にわたる海底環境における視覚的局所化を促進するために、複数のサイトと数年にわたるデータを含むキュレーションされたデータセット、視覚的重なりを正確に評価するための足跡ベースの真値推定手法、および最先端の視覚的場所認識手法のベンチマーク結果を提示するものである。

Martin Kvisvik Larsen, Oscar Pizarro2026-03-05💻 cs

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

本論文は、Stable Diffusion 3 のようなマルチエンコーダ構造を持つ拡散モデルにおいて、全パラメータの 0.2% 未満を学習する軽量な手法「MELT」を提案し、複数の大規模テキストエンコーダを組み合わせた環境でも効率的かつ効果的なバックドア攻撃が可能であることを実証しています。

Ziyuan Chen, Yujin Jeong, Tobias Braun + 1 more2026-03-05🤖 cs.LG

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

本研究は、極小パッチ（40x40 ピクセル）の細胞レベル病理画像解析において、十分な学習データがあればタスク特化型アーキテクチャがファウンデーションモデルよりも高精度かつ効率的であることを示し、大規模事前学習モデルの優位性は限定的であると結論付けています。

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi + 5 more2026-03-05💻 cs

← 前へ次へ →

cs.CV

Phi-4-reasoning-vision-15B Technical Report

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

RIVER: A Real-Time Interaction Benchmark for Video LLMs

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

Discriminative Perception via Anchored Description for Reasoning Segmentation

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination