cs.CV 篇论文 | Gist.Science

UniTS: Unified Spatio-Temporal Generative Model for Remote Sensing

本文提出了名为 UniTS 的统一时空生成模型，该模型基于流匹配范式，通过自适应条件注入器和时空感知调制器，将遥感领域的时间序列重建、去云、语义变化检测及预测等多个核心任务整合到一个通用框架中，并在各种复杂条件下显著超越了现有的专用模型。

Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia2026-03-09💻 cs

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

该论文提出了一种基于点云框架的事件驱动人体姿态估计方法，通过设计事件时间切片卷积与序列模块以及边缘增强表示，有效利用事件流的时空特性，在保持计算效率的同时显著提升了在稀疏事件条件下的姿态估计精度。

Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Pengfei Ye, Haodong Chen, Yuk Ying Chung, Qiang Qu2026-03-09🤖 cs.AI

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

本文提出了 DFIR-DETR，一种通过动态内容特征聚合（DCFA）、动态特征金字塔网络（DFPN）和频域迭代细化模块（FIRC3）来分别解决注意力分配不均、上采样细节丢失及高频边缘平滑问题的 Transformer 检测器，在 NEU-DET 和 VisDrone 数据集上以轻量级架构实现了显著的小目标检测性能提升。

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li2026-03-09🤖 cs.LG

Fast-BEV++: Fast by Algorithm, Deployable by Design

本文提出了 Fast-BEV++ 框架，通过采用面向硬件的索引 - 收集 - 重塑流水线架构及可学习深度模块，在消除自定义算子依赖的同时实现了 3 倍以上的推理加速，从而在 nuScenes 基准测试中达到 0.488 NDS 的 SOTA 精度并支持超过 134 FPS 的实时部署。

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao2026-03-09💻 cs

Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

该论文针对现有基于子集选择的视觉解释方法在分布外（OOD）场景下可靠性下降的问题，提出了一种结合子模优化与不确定性估计的无训练框架，通过自适应权重扰动引导子集选择，显著提升了模型在分布偏移下的鲁棒性与解释忠实度。

Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan2026-03-09🤖 cs.LG

Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

Photo3D 提出了一种利用 GPT-4o 生成图像并经由结构对齐多视图合成与细节增强方案构建高质量数据集的框架，旨在解决真实世界 3D 资产稀缺难题，从而显著提升各类原生 3D 生成模型的几何结构与纹理细节的逼真度。

Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang2026-03-09💻 cs

Modular Neural Image Signal Processing

该论文提出了一种高度模块化的神经图像信号处理（ISP）框架，通过完全基于学习的方法实现了对渲染过程中间阶段的灵活控制，从而在提升渲染精度、可扩展性及风格适配能力的同时，支持了可无限次重渲染的交互式照片编辑工具。

Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown2026-03-09💻 cs

A Novel Patch-Based TDA Approach for Computed Tomography Imaging

本文提出了一种针对 CT 影像的新型基于补丁的拓扑数据分析（TDA）方法，通过构建持久同调特征，在分类性能（如准确率、AUC 等指标平均提升 2.7% 至 8.0%）和计算效率上均显著优于传统的 3D 立方复形算法及放射组学特征，并发布了配套的 Python 工具包 Patch-TDA。

Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Mohammad Hamghalam, Natalie Gangai, Richard K. G. Do, Alice C. Wei, Amber L. Simpson2026-03-09🤖 cs.LG

Towards Scalable Pre-training of Visual Tokenizers for Generation

该论文提出了 VTP 统一预训练框架，通过联合优化图像 - 文本对比、自监督和重建损失，解决了视觉 Tokenizer 预训练中的扩展性难题，证明了高语义理解能力是提升生成质量的关键，并实现了生成性能随计算资源有效扩展的突破。

Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang2026-03-09💻 cs

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

该论文通过深入分析并改进交叉注意力机制，证明了其在视觉语言模型中不仅能实现与直接插入图像令牌相当的性能，还能显著降低长序列多图像对话及实时视频处理中的显存与计算开销。

Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez2026-03-09🤖 cs.AI

Pretraining Frame Preservation for Lightweight Autoregressive Video History Embedding

该论文提出了一种轻量级视频历史编码器，通过预训练帧查询目标实现长视频历史的高效压缩，并在微调阶段适配自回归生成任务，从而在有限计算资源下实现了与重型模型相当的内容一致性表现。

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala2026-03-09💻 cs

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

本文提出了名为 Spatial4D-Bench 的大规模、多样化 4D 空间智能基准，旨在通过涵盖 18 种任务和 6 个认知类别的约 4 万组问答对，全面评估多模态大语言模型在 4D 空间推理方面的能力并揭示其当前局限性。

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu2026-03-09💻 cs

cs.CV

UniTS: Unified Spatio-Temporal Generative Model for Remote Sensing

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

Fast-BEV++: Fast by Algorithm, Deployable by Design

Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

Modular Neural Image Signal Processing

A Novel Patch-Based TDA Approach for Computed Tomography Imaging

Towards Scalable Pre-training of Visual Tokenizers for Generation

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

Pretraining Frame Preservation for Lightweight Autoregressive Video History Embedding

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Bayesian Monocular Depth Refinement via Neural Radiance Fields

FlyPose: Towards Robust Human Pose Estimation From Aerial Views

Robust Sparse Signal Recovery with Outliers: A Hard Thresholding Pursuit Approach Based on LAD

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

FARTrack: Fast Autoregressive Visual Tracking with High Performance

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning