S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

The paper introduces S2DiT, a novel Streaming Sandwich Diffusion Transformer that leverages efficient attention mechanisms, a budget-aware sandwich architecture, and a 2-in-1 distillation framework to achieve high-fidelity, real-time video generation on mobile devices with performance comparable to server-grade models.

Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, Sergey Tulyakov, Yanzhi Wang, Anil Kag, Yanyu Li2026-03-10💻 cs

ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance

This paper introduces ReViP, a novel Vision-Language-Action framework that mitigates "false completion" failures caused by proprioceptive bias through vision-proprioception rebalancing and a new benchmark suite, achieving significant performance gains over existing models.

Zhuohao Li, Yinghao Li, Jian-Jian Jiang, Lang Zhou, Tianyu Zhang, Jiadong Yin, Mu Lin, Yi-Kin Wei, Wei-Shi Zheng2026-03-10💻 cs

ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

This paper introduces ScenePilot-Bench, a large-scale benchmark built on the diverse ScenePilot-4K dataset to comprehensively evaluate and advance vision-language models in autonomous driving through multi-granularity annotations and a safety-aware, four-axis assessment framework.

Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Li Zhang, Bingzhao Gao, Daxin Tian, Jianqiang Wang, Hong Chen2026-03-10💻 cs

Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

This paper proposes QSTar, a novel query-guided spatial-temporal-frequency interaction method enhanced by a Query Context Reasoning block, which significantly improves Audio-Visual Question Answering performance by deeply integrating question-guided clues and audio frequency characteristics with visual perception, outperforming existing multimodal approaches on multiple benchmarks.

Kun Li, Michael Ying Yang, Sami Sebastian Brandt2026-03-10💻 cs

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache is a training-free framework that accelerates Flow Matching inference by replacing instantaneous velocity caching with an average-velocity approach using cached Jacobian-vector products and a trajectory-stability scheduling strategy, achieving significant speedups (up to 4.56X) while maintaining high generation quality across models like FLUX.1 and HunyuanVideo.

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian2026-03-10🤖 cs.LG

Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

To address the challenges of parameter-efficient domain adaptation in V2X collaborative perception, the paper proposes FlowAdapt, a framework leveraging optimal transport theory and a progressive knowledge transfer mechanism to filter redundant data and preserve fine-grained semantics, achieving state-of-the-art performance with only 1% trainable parameters.

Zesheng Jia, Jin Wang, Siao Liu, Lingzhi Li, Ziyao Huang, Yunjiang Xu, Jianping Wang2026-03-10💻 cs

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

This paper proposes SToRM, a novel framework that employs a lightweight importance predictor, supervised training with pseudo-labels, and an anchor-context merging module to significantly reduce visual token redundancy in multi-modal LLMs for autonomous driving, achieving up to 30x computational savings while maintaining end-to-end performance comparable to using all tokens.

Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, Il Yong Chun2026-03-10💻 cs

3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

The paper introduces 3DMedAgent, a unified agent that leverages a flexible MLLM and long-term structured memory to coordinate heterogeneous tools for decomposing complex 3D CT analysis into tractable 2D-based subtasks, thereby enabling general-purpose 3D medical understanding without 3D-specific fine-tuning.

Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu, Jingyu Wang, Rui Wang, Lei Song, Jiang Bian, Jingjing Fu, Yueming Jin2026-03-10💻 cs

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

OVerSeeC is a zero-shot modular framework that leverages large language models and open-vocabulary segmentation to generate executable global costmaps from satellite imagery and natural language instructions, enabling autonomous navigation to adapt to novel entities and dynamic mission constraints without requiring fixed ontologies.

Rwik Rana, Jesse Quattrociocchi, Dongmyeong Lee, Christian Ellis, Amanda Adkins, Adam Uccello, Garrett Warnell, Joydeep Biswas2026-03-10💻 cs

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

This paper introduces Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting and benchmark for autonomous driving that addresses both unseen domains and categories, and proposes S2-Corr, a state-space-driven mechanism to refine text-image correlations in Vision-Language Models to achieve robust performance across diverse urban environments.

Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong2026-03-10💻 cs

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

UniMatch is a novel coarse-to-fine framework that establishes dense semantic correspondences between strongly non-isometric, cross-category 3D shapes by leveraging class-agnostic segmentation, multimodal language models for part identification, and a rank-based contrastive learning scheme to overcome the limitations of prior isometry-dependent methods.

Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick2026-03-10💻 cs

Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

This paper proposes an online data manipulation scheme that decomposes training images into independent object, scene, and camera components and recomposes them with perturbed poses to generate diverse training data, thereby improving the data efficiency and performance of monocular 3D object detection models across both fully and sparsely supervised settings.

Zhaonian Kuang, Rui Ding, Meng Yang + 2 more2026-03-10💻 cs

Cycle-Consistent Tuning for Layered Image Decomposition

This paper presents a cycle-consistent tuning framework that leverages lightweight LoRA adaptation of pretrained diffusion models to achieve robust, high-fidelity layered image decomposition, specifically for challenging logo-object separation, by enforcing bidirectional reconstruction consistency and iteratively refining performance through a progressive self-improving process.

Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-Or, Hui Huang2026-03-10💻 cs

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

This paper proposes "See It, Say It, Sorted," a lightweight, training-free, and plug-and-play framework that mitigates visual hallucination in large vision-language models by iteratively supervising each reasoning step with dynamically extracted visual evidence, thereby significantly improving reasoning accuracy without requiring additional model training.

Yongchang Zhang, Oliver Ma, Tianyi Liu, Guangquan Zhou, Yang Chen2026-03-10💻 cs

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

WISER is a training-free framework for Zero-Shot Composed Image Retrieval that unifies Text-to-Image and Image-to-Image paradigms through a "retrieve-verify-refine" pipeline, leveraging wider search, adaptive fusion, and self-reflection to significantly outperform existing methods across diverse benchmarks.

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang2026-03-10💻 cs