cs.CV 篇论文 | Gist.Science

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

该论文针对医疗图像分割中的数据稀缺与分布偏移问题，提出了一种基于可交换性假设和因果框架的跨层特征控制方法，通过有效缓解数据合并带来的分布差异，在五种数据集上实现了优于现有基线的分割性能。

Ayush Roy, Samin Enam, Jun Xia + 2 more2026-02-27🤖 cs.LG

LayerT2V: A Unified Multi-Layer Video Generation Framework

本文提出了 LayerT2V 框架，通过利用视频生成骨干网络的高压缩特性将多层表示序列化并联合建模，首次实现了单次推理即可生成包含背景、前景及透明通道且语义一致的可编辑分层视频，并配套发布了首个大规模分层视频数据集 VidLayer。

Guangzhao Li, Kangrui Cen, Baixuan Zhao + 5 more2026-02-27🤖 cs.AI

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

RAP 提出了一种基于视频扩散变换器的统一框架，通过引入混合注意力机制与静动态训练推理范式，在满足实时延迟和内存约束的同时，实现了高保真且音画同步的音频驱动肖像动画生成。

Fangyu Du, Taiqing Li, Qian Qiao + 7 more2026-02-27⚡ eess

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

本文提出了名为 MixCache 的免训练框架，通过引入上下文感知的缓存触发机制与自适应混合粒度决策策略，有效解决了现有视频 DiT 模型缓存方法单一、难以平衡生成质量与推理速度的问题，在显著提升视频生成加速比的同时保持了优越的生成质量。

Yuanxin Wei, Lansong Diao, Bujiao Chen + 6 more2026-02-27🤖 cs.LG

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

本文提出了一种名为 Dyslexify 的训练免费防御方法，通过因果分析定位并选择性消融 CLIP 模型中负责提取文字信息的注意力头，从而在不显著降低标准性能的前提下，有效抵御针对多模态系统的排版攻击。

Lorenz Hufe, Constantin Venhoff, Erblina Purelku + 3 more2026-02-27🤖 cs.AI

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

本文提出了一种面向图像的自适应性多模态安全数据集构建方法，通过从图像出发自动生成包含 3.5 万对图文及引导回复的 RMS 数据集，并引入标准化评估指标，有效解决了现有风险导向方法难以覆盖真实世界复杂安全场景及缺乏统一评估标准的问题。

Jingen Qu, Lijun Li, Bo Zhang + 2 more2026-02-27💬 cs.CL

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

本文提出了一种名为 Loc $^2$ 的可解释性跨视角定位方法，通过弱监督学习地面与航拍图像的特征对应关系，结合单目深度预测将匹配点提升至鸟瞰图空间并进行尺度感知对齐，从而在无需像素级标注的情况下实现了高精度的 3 自由度位姿估计。

Zimin Xia, Chenghao Xu, Alexandre Alahi2026-02-27💻 cs

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

本文提出了一种名为 ST-GS 的时空高斯泼溅框架，通过引导式空间聚合策略和几何感知时间融合方案，有效增强了基于高斯的 3D 语义占据预测中的多视角空间交互与多帧时间一致性，在 nuScenes 基准测试中实现了优于现有方法的性能与时间连贯性。

Xiaoyang Yan, Muleilan Pei, Shaojie Shen2026-02-27💻 cs

Visual Instruction Pretraining for Domain-Specific Foundation Models

本文提出了视觉指令预训练（ViTP）框架，通过结合视觉语言模型与视觉鲁棒性学习（VRL），利用目标领域的推理数据增强基础感知模型，从而在遥感与医学成像等多个下游任务中实现了新的最先进性能。

Yuxuan Li, Yicheng Zhang, Wenhao Tang + 4 more2026-02-27💻 cs

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

本文提出了首个基于大规模原生 3D 数据训练的提示性部件分割模型 PartSAM，它通过三平面双分支编码器架构和自研的模型循环标注管线，克服了现有基于 2D 迁移方法的局限，实现了对 3D 物体表面及内部结构的高精度开放世界部件分割。

Zhe Zhu, Le Wan, Rui Xu + 6 more2026-02-27💻 cs

Secure and reversible face anonymization with diffusion models

本文提出了一种基于扩散模型的首个可逆人脸匿名化框架，通过秘密密钥条件化机制，在确保生成图像高质量的同时，实现了仅授权方可进行精确身份恢复的安全匿名化方案。

Pol Labarbarie, Vincent Itier, William Puech2026-02-27🤖 cs.LG

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

该论文提出了一种异步扩散模型框架，通过为不同像素分配独立的去噪时间步，使提示相关区域能利用更清晰的上下文信息，从而显著提升了文本到图像生成的对齐效果。

Zijing Hu, Yunze Tong, Fengda Zhang + 3 more2026-02-27💻 cs

Detection and Measurement of Hailstones with Multimodal Large Language Models

该研究利用预训练的多模态大语言模型，通过分析奥地利 2022 至 2024 年间社交媒体上的 474 张冰雹图像，证明了无需微调即可结合参考物体提示策略以约 1.12 厘米的平均绝对误差自动估算冰雹直径，从而为传统冰雹传感器提供了补充性的空间密集数据源。

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

该论文提出了名为 FlowRVS 的新框架，通过将指代视频分割任务重构为从视频整体表征到目标掩码的语言引导连续形变问题，利用预训练文生视频模型的优势克服传统级联方法的局限，并在多个基准测试中取得了最先进的性能。

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

G4Splat 提出了一种利用生成先验进行 3D 场景重建的新方法，通过利用平面结构推导精确的度量深度图作为几何监督，并结合视频扩散模型解决多视图不一致问题，从而在单视图输入和无姿态视频等复杂场景下实现了高质量且几何准确的场景补全。

Junfeng Ni, Yixin Chen, Zhifei Yang + 4 more2026-02-27💻 cs

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

本文提出了利用场景图引导大语言模型作为裁判的 PoSh 指标，并发布了包含艺术领域专家标注的 DOCENT 数据集，以解决现有评估方法难以衡量长文本图像描述中细粒度属性与关系错误的难题，从而更准确地评估视觉语言模型在复杂场景下的描述能力。

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford + 7 more2026-02-27💬 cs.CL

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

该研究提出了一种基于自监督深度学习的标签高效方法，利用仅 1,000 个标注样本和大量未标记的 1 米分辨率航空影像预训练模型，成功实现了美国密西西比州大范围的高精度土地覆盖分类，有效克服了高分辨率制图中标注数据稀缺的瓶颈。

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira + 1 more2026-02-27💻 cs

cs.CV

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

LayerT2V: A Unified Multi-Layer Video Generation Framework

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Visual Instruction Pretraining for Domain-Specific Foundation Models

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Secure and reversible face anonymization with diffusion models

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Detection and Measurement of Hailstones with Multimodal Large Language Models

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

cs.CV

Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

LayerT2V: A Unified Multi-Layer Video Generation Framework

RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Loc2^22: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Visual Instruction Pretraining for Domain-Specific Foundation Models

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Secure and reversible face anonymization with diffusion models

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Detection and Measurement of Hailstones with Multimodal Large Language Models

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Q2^22: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization