UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign proposes a novel post-hoc calibration framework that aligns frozen vision-language models with human preferences for urban scene assessment by mining interpretable dimensions, extracting robust concept scores via an Observer-Debater-Judge chain, and calibrating them through locally-weighted ridge regression, achieving state-of-the-art accuracy without any model retraining.

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi2026-03-09💻 cs

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

This paper demonstrates that affordance reasoning in Vision Foundation Models can be achieved in a zero-shot, training-free manner by fusing DINO's inherent geometric part structures with Flux's verb-conditioned interaction priors, thereby establishing geometric and interaction perception as the fundamental, composable building blocks of affordance understanding.

Qing Zhang, Xuesong Li, Jing Zhang2026-03-09💻 cs

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

StoryTailor is a zero-shot pipeline that generates temporally coherent, action-rich multi-subject visual narratives on a single RTX 4090 by synergizing Gaussian-Centered Attention, Action-Boost Singular Value Reweighting, and a Selective Forgetting Cache to simultaneously ensure action faithfulness, subject identity fidelity, and cross-frame background continuity.

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang2026-03-09💻 cs

UniVBench: Towards Unified Evaluation for Video Foundation Models

The paper introduces UniVBench, a comprehensive benchmark featuring 200 high-quality, human-created multi-shot videos and a unified agentic evaluation system (UniV-Eval) to holistically assess video foundation models across understanding, generation, editing, and reconstruction tasks, addressing the limitations of existing fragmented and task-specific evaluations.

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu2026-03-09💻 cs

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

The paper introduces DPCache, a training-free acceleration framework for diffusion models that formulates sampling as a global path planning problem and utilizes dynamic programming on a path-aware cost tensor to select optimal key timesteps, thereby achieving significant speedups with minimal quality loss and even surpassing full-step baselines in certain metrics.

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang2026-03-09💻 cs

Unified Learning-to-Rank for Multi-Channel Retrieval in Large-Scale E-Commerce Search

This paper proposes a unified, query-dependent learning-to-rank model that effectively merges heterogeneous retrieval channels for large-scale e-commerce search by jointly optimizing business KPIs and capturing short-term user intent, resulting in a 2.85% conversion lift and deployment on Target.com while meeting strict latency constraints.

Aditya Gaydhani, Guangyue Xu, Dhanush Kamath, Ankit Singh, Alex Li2026-03-09💻 cs

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper introduces Synthetic Visual Genome 2 (SVG2), a large-scale automated panoptic video scene graph dataset with over 636K videos, and presents TRaSER, a novel model that leverages trajectory-aligned token mechanisms to significantly outperform existing baselines in scene graph generation and downstream video question answering tasks.

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna2026-03-09💻 cs

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

This paper introduces PanScale, a large-scale cross-scale pansharpening dataset and benchmark, alongside ScaleFormer, a novel transformer-based architecture that achieves superior generalization across varying image resolutions by reframing scale adaptation as sequence length generalization through tokenization and rotary positional encoding.

Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang2026-03-09💻 cs

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

This paper introduces Think-as-You-See (TaYS), a unified framework that enables concurrent, streaming Chain-of-Thought reasoning for Large Vision-Language Models by decoupling visual encoding from textual reasoning, thereby outperforming traditional batch and interleaved approaches in both accuracy and latency for real-time video understanding.

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen2026-03-09💻 cs