cs.CV 件の論文 | Gist.Science

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

本論文は、既存の評価手法の限界を克服し、自動化パイプラインを用いた「感情ステートメント判定」タスクを提案することで、マルチモーダル大規模言語モデルの視覚的感情認識能力をオープンボキャブラリーかつ多角的に評価する新たな枠組みを構築し、現状のモデルと人間の間に依然として大きなギャップがあることを明らかにしています。

Daiqing Wu, Dongbao Yang, Sicheng Zhao + 2 more2026-03-03💻 cs

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

この論文は、医療画像セグメンテーションから導出されるメトリック（例：臓器の大きさ）の不確実性を効率的に保証するために、深層学習モデルの中間特徴量空間を利用した新しい共形予測フレームワーク「COMPASS」を提案し、従来の手法よりも狭い信頼区間を実現しつつ、共変量シフト下でも目標カバレッジを維持できることを示しています。

Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan2026-03-03⚡ eess

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

本論文は、視覚的認識から記号的推論に至るまでエンジニアリング設計の階層的プロセスを評価する新たなベンチマーク「CircuitSense」を提案し、既存のマルチモーダル大規模言語モデルが視覚情報の数式化において重大な限界を抱えていることを明らかにした。

Arman Akbari, Jian Gao, Yifei Zou + 6 more2026-03-03💻 cs

Towards Interpretable Visual Decoding with Attention to Brain Representations

この論文は、中間特徴空間を介さずに脳活動から直接画像を生成するフレームワーク「NeuroAdapter」と、拡散モデルの生成過程における脳領域の寄与を可視化する解釈性フレームワーク「IBBI」を提案し、脳信号に基づく視覚復元の透明性と解釈可能性を向上させたことを示しています。

Pinyuan Feng, Hossein Adeli, Wenxuan Guo + 3 more2026-03-03💻 cs

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

本論文は、文字レベルの生成に依存する既存手法の課題を解決し、OCR 損失とスタイル分類損失による二重正則化を備えた InkVAE と、潜在拡散トランスフォーマーである InkDiT を組み合わせた「DiffInk」を提案することで、テキストから高品質かつ効率的な全行のオンライン手書き生成を実現するものです。

Wei Pan, Huiguo He, Hiuyi Cheng + 2 more2026-03-03💻 cs

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

本論文は、教師あり学習と強化学習を反復的に組み合わせる「SMART-R1」という新しい微調整手法を提案し、Waymo Open Sim Agents Challenge でリアルタイムシミュレーションの性能を大幅に向上させ、首位を獲得したことを報告しています。

Muleilan Pei, Shaoshuai Shi, Shaojie Shen2026-03-03💻 cs

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

この論文は、大規模な人間評価データセットを用いて訓練された新しい報酬モデル「EditReward」を提案し、指示に基づく画像編集タスクにおける人間の嗜好との高い整合性を示すことで、高品質な合成学習データの拡張や編集モデルの性能向上に貢献することを報告しています。

Keming Wu, Sicong Jiang, Max Ku + 3 more2026-03-03💬 cs.CL

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Stylos は、ポーズ情報やシーンごとの最適化を必要とせず、単一の画像から多視点の 3D 画像まで、参照スタイル画像に基づいて幾何学的忠実性と視点一貫性を両立したゼロショット 3D 様式変換を実現する、単一フォワードの 3D ガウススプラッティングフレームワークです。

Hanzhou Liu, Jia Huang, Mi Lu + 2 more2026-03-03💻 cs

Culture In a Frame: C $^3$ B as a Comic-Based Benchmark for Multimodal Culturally Awareness

この論文は、既存のベンチマークが抱える難易度や多言語性の課題を克服し、多文化・多言語・多タスクな漫画データを用いて Multimodal Large Language Models の文化的意識能力を評価する新たなベンチマーク「C $^3$ B」を提案し、現在のモデルと人間の間に大きな性能差があることを示したものです。

Yuchen Song, Andong Chen, Wenxin Zhu + 4 more2026-03-03🤖 cs.AI

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

本論文は、動画の時間的整合性を明示的に捉える Video Consistency Models（VCM）を活用し、自動微分を不要としつつ少数の推論ステップで高解像度動画復元において最先端の画質と計算効率を実現する、初のゼロショット・プラグアンドプレイ型逆問題ソルバー「LVTINO」を提案するものである。

Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra2026-03-03📊 stat

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

本論文は、DiT（Diffusion Transformer）の強力な事前知識をドラッグ編集に活用するため、点ベースではなく領域ベースの編集パラダイムを導入し、背景の忠実性を保ちつつ被写体の整合性を高める「DragFlow」を提案し、新しい最先端性能を達成したことを報告しています。

Zihan Zhou, Shilin Lu, Shuli Leng + 4 more2026-03-03🤖 cs.AI

ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

本論文は、大規模視覚言語モデルにおける関係性の幻覚を軽減するため、画像とテキストの記憶を蓄積し、主語・目的語・関係性に焦点を当てた多視点の質問を逐次的に提示するトレーニング不要な手法「ChainMPQ」を提案し、その有効性を複数のベンチマークで実証したものである。

Yike Wu, Yiwei Wang, Yujun Cai2026-03-03🤖 cs.AI

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

本論文は、心エコー検査のプローブ誘導における個人差への対応を強化するため、超音波基盤モデルに個体固有の3D構造理解能力をオンラインで付与する「VA-Adapter」を提案し、131 万を超えるサンプルを用いた大規模実験で既存モデルを凌駕する性能を低パラメータで実現したことを示しています。

Teng Wang, Haojun Jiang, Yuxuan Wang + 4 more2026-03-03💻 cs

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

本論文は、推論時にスパティオテンプラルなレイアウトとテキスト・画像の整合性を保つために、学習不要の「テスト時最適化と記憶（TTOM）」フレームワークを提案し、パラメトリックな記憶メカニズムを用いて動画生成のコンポジション能力を飛躍的に向上させることを示しています。

Leigang Qu, Ziyang Wang, Na Zheng + 3 more2026-03-03💬 cs.CL

Splat the Net: Radiance Fields with Splattable Neural Primitives

この論文は、NeRF の表現力と 3D Gaussian Splatting の高速レンダリングを両立させ、従来の手法に比べて 10 倍少ないプリミティブと 6 倍少ないパラメータで高品質な新規視点合成を実現する「Splat the Net」と呼ばれる新しい体積表現手法を提案しています。

Xilong Zhou, Bao-Huy Nguyen, Loïc Magne + 3 more2026-03-03💻 cs

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

本論文は、線形アテンションの計算効率と生成モデルの画質を両立させるため、学習不安定性を解消する「ESGF」戦略、知覚と歪みのトレードオフを克服する「SNR 基盤の MoE」アーキテクチャ、そして軽量な「TAG」ガイダンスを組み合わせた画期的な超解像フレームワーク「LinearSR」を提案し、安定かつ効率的なフォトリアリスティックな画像超解像を実現したことを示しています。

Xiaohui Li, Shaobin Zhuang, Shuo Cao + 6 more2026-03-03💻 cs

PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

本論文は、階層性と構成性を同時に表現する新たな視覚言語モデル「PHyCLIP」を提案し、双曲空間の直積に $\ell_1$ 距離を導入することで、概念間の階層関係と異種概念の組み合わせを効率的に学習し、既存手法を上回る性能と解釈可能性を実現したことを示しています。

Daiki Yoshikawa, Takashi Matsubara2026-03-03🤖 cs.LG

← 前へ次へ →

cs.CV

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Towards Interpretable Visual Decoding with Attention to Brain Representations

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Culture In a Frame: C $^3$ B as a Comic-Based Benchmark for Multimodal Culturally Awareness

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Splat the Net: Radiance Fields with Splattable Neural Primitives

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Incomplete Multi-Label Image Recognition by Co-learning Semantic-Aware Features and Label Recovery

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

cs.CV

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Towards Interpretable Visual Decoding with Attention to Brain Representations

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Culture In a Frame: C3^33B as a Comic-Based Benchmark for Multimodal Culturally Awareness

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Splat the Net: Radiance Fields with Splattable Neural Primitives

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

PHyCLIP: ℓ1\ell_1ℓ1​-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Incomplete Multi-Label Image Recognition by Co-learning Semantic-Aware Features and Label Recovery

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

Culture In a Frame: C $^3$ B as a Comic-Based Benchmark for Multimodal Culturally Awareness

PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning