cs.CV 篇论文 | Gist.Science

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

CRISP 提出了一种从单目视频中恢复可模拟的人体运动与场景几何的新方法，其核心在于通过拟合平面基元构建凸且干净的仿真就绪几何、利用人体接触建模补全遮挡区域，并结合强化学习控制器确保物理合理性，从而显著降低了运动跟踪失败率并提升了仿真效率。

Zihan Wang, Jiashun Wang, Jeff Tan + 4 more2026-03-03💻 cs

SoFlow: Solution Flow Models for One-Step Generative Modeling

本文提出了 Solution Flow Models (SoFlow) 框架，通过结合流匹配损失与无需计算雅可比 - 向量积（JVP）的解一致性损失，实现了从 scratch 训练的高效单步生成模型，并在 ImageNet 256x256 数据集上超越了 MeanFlow 模型。

Tianze Luo, Haotian Yuan, Zhuang Liu2026-03-03🤖 cs.LG

AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

本文提出了一种结合深度学习图像分析与包含家族史数据的可解释多模态 AI 框架，旨在通过整合遗传风险因素提升皮肤病诊断的准确性与个性化水平，并规划了后续的临床验证以推动其在医疗工作流中的实际部署。

Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer2026-03-03🤖 cs.AI

GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

本文提出了名为 GeoTeacher 的半监督 3D 目标检测框架，通过设计基于关键点几何关系的监督模块和引入距离衰减机制的体素级数据增强策略，有效解决了有限标注数据下模型对物体几何信息敏感度低的问题，从而在 ONCE 和 Waymo 数据集上实现了新的最先进性能。

Jingyu Li, Xiaolong Zhao, Zhe Liu + 2 more2026-03-03💻 cs

ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

该研究提出了一种名为"ForCM"的新方法，通过将多种深度学习模型（如 AttentionUNet 和 ResUNet）与面向对象图像分析（OBIA）相结合，利用 Sentinel-2 多光谱影像显著提升了亚马逊雨林森林覆盖的制图精度（最高达 95.64%），并验证了结合开源工具进行全球环境监测的潜力。

Maisha Haque, Israt Jahan Ayshi, Sadaf M. Anis + 8 more2026-03-03🤖 cs.AI

cs.CV

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

SoFlow: Solution Flow Models for One-Step Generative Modeling

AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Aligned explanations in neural networks

TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

Counterfactual Explanations on Robust Perceptual Geodesics

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection

Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Investigating Disability Representations in Text-to-Image Models

RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D