SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

SpatialMem is a memory-centric system that constructs a metric-aligned 3D scaffold from casual egocentric RGB videos to enable efficient, interpretable long-horizon language grounding, retrieval, and QA by linking open-vocabulary object nodes to spatial coordinates without requiring specialized sensors.

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen2026-03-09🤖 cs.AI

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

This paper introduces SRA 2, a lightweight intrinsic guidance framework that accelerates diffusion transformer training and improves generation quality by aligning intermediate latent features with pre-trained VAE features via a simple projection layer, eliminating the need for external encoders or dual-model setups while incurring minimal computational overhead.

Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang2026-03-09💻 cs

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

The paper introduces SpatialReward, a reward model that leverages explicit spatial reasoning to overcome the "Attention Collapse" limitation in existing evaluators, thereby providing fine-grained, accurate signals that significantly enhance online reinforcement learning performance for image editing tasks.

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang2026-03-09💻 cs

MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery

This paper introduces MiDAS, an open-source, platform-agnostic system that enables non-invasive, time-synchronized multimodal data acquisition for robot-assisted minimally invasive surgery, validated by demonstrating that its external sensing approach achieves gesture recognition performance comparable to proprietary telemetry while releasing the first annotated dataset for hernia repair suturing.

Keshara Weerasinghe (MD), Seyed Hamid Reza Roodabeh (MD), Andrew Hawkins (MD), Zhaomeng Zhang, Zachary Schrader, Homa Alemzadeh2026-03-09🤖 cs.LG

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

DAV-GSWT is a data-efficient framework that combines diffusion priors with active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations, enabling the generation of expansive, photorealistic landscapes without relying on densely sampled exemplar reconstructions.

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong2026-03-09💻 cs

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign proposes a novel post-hoc calibration framework that aligns frozen vision-language models with human preferences for urban scene assessment by mining interpretable dimensions, extracting robust concept scores via an Observer-Debater-Judge chain, and calibrating them through locally-weighted ridge regression, achieving state-of-the-art accuracy without any model retraining.

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi2026-03-09💻 cs

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

This paper demonstrates that affordance reasoning in Vision Foundation Models can be achieved in a zero-shot, training-free manner by fusing DINO's inherent geometric part structures with Flux's verb-conditioned interaction priors, thereby establishing geometric and interaction perception as the fundamental, composable building blocks of affordance understanding.

Qing Zhang, Xuesong Li, Jing Zhang2026-03-09💻 cs

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

StoryTailor is a zero-shot pipeline that generates temporally coherent, action-rich multi-subject visual narratives on a single RTX 4090 by synergizing Gaussian-Centered Attention, Action-Boost Singular Value Reweighting, and a Selective Forgetting Cache to simultaneously ensure action faithfulness, subject identity fidelity, and cross-frame background continuity.

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang2026-03-09💻 cs

UniVBench: Towards Unified Evaluation for Video Foundation Models

The paper introduces UniVBench, a comprehensive benchmark featuring 200 high-quality, human-created multi-shot videos and a unified agentic evaluation system (UniV-Eval) to holistically assess video foundation models across understanding, generation, editing, and reconstruction tasks, addressing the limitations of existing fragmented and task-specific evaluations.

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu2026-03-09💻 cs

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

The paper introduces DPCache, a training-free acceleration framework for diffusion models that formulates sampling as a global path planning problem and utilizes dynamic programming on a path-aware cost tensor to select optimal key timesteps, thereby achieving significant speedups with minimal quality loss and even surpassing full-step baselines in certain metrics.

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang2026-03-09💻 cs

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper introduces Synthetic Visual Genome 2 (SVG2), a large-scale automated panoptic video scene graph dataset with over 636K videos, and presents TRaSER, a novel model that leverages trajectory-aligned token mechanisms to significantly outperform existing baselines in scene graph generation and downstream video question answering tasks.

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna2026-03-09💻 cs

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

This paper introduces PanScale, a large-scale cross-scale pansharpening dataset and benchmark, alongside ScaleFormer, a novel transformer-based architecture that achieves superior generalization across varying image resolutions by reframing scale adaptation as sequence length generalization through tokenization and rotary positional encoding.

Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang2026-03-09💻 cs