Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework

This paper proposes a Self-Critical Inference (SCI) framework that enhances the robustness of Large Vision-Language Models against language bias and sensitivity through multi-round counterfactual reasoning with textual and visual perturbations, alongside a new Dynamic Robustness Benchmark (DRBench) for model-specific evaluation.

Kaihua Tang, Jiaxin Qi, Jinli Ou, Yuhua Zheng, Jianqiang Huang2026-03-10💻 cs

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

This paper introduces Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video streams without human intervention, which provides 4 million high-quality 3D semantic annotations and spatial QA pairs to significantly enhance the training and performance of Vision-Language Models on spatial reasoning tasks.

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong2026-03-10💻 cs

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

UniUncer is a lightweight, unified framework for end-to-end autonomous driving that jointly estimates and leverages uncertainty for both static map elements and dynamic agents through probabilistic regression, uncertainty-aware query fusion, and adaptive gating, thereby significantly improving trajectory accuracy and planning robustness with minimal computational overhead.

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao2026-03-10💻 cs

PARSE: Part-Aware Relational Spatial Modeling

The paper introduces PARSE, a framework utilizing part-level geometric relations encoded in Part-centric Assembly Graphs to resolve spatial ambiguities, which is validated through the creation of the PARSE-10K dataset and demonstrated to significantly enhance both object layout reasoning in vision-language models and the physical realism of generated 3D scenes.

Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu2026-03-10💻 cs

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

To address the "spatial intelligence gap" where Vision-Language Models struggle with elementary 3D tasks despite strong logical reasoning, the paper introduces 3ViewSense, a framework that leverages an engineering-inspired "Simulate-and-Reason" mechanism to ground spatial understanding in orthographic views, significantly improving performance on occlusion-heavy counting and view-consistent reasoning benchmarks.

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng2026-03-10💬 cs.CL

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

The paper proposes AR2-4FV, a novel framework for long-term language-guided referring in fixed-view videos that leverages a static background-derived Anchor Bank and a ReID-Gating mechanism to maintain identity continuity and accelerate re-capture during occlusions or scene exits, significantly outperforming existing baselines in re-capture rate and latency.

Teng Yan, Yihan Liu, Jiongxu Chen, Teng Wang, Jiaqi Li, Bingzhuo Zhong2026-03-10💻 cs

DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising

The paper proposes DECADE, an unsupervised diffusion model that achieves temporally consistent denoising of Rb-82 dynamic cardiac PET images without paired training data, effectively reducing noise while preserving quantitative accuracy for myocardial blood flow and flow reserve metrics.

Yinchi Zhou, Liang Guo, Huidong Xie, Yuexi Du, Ashley Wang, Menghua Xia, Tian Yu, Ramesh Fazzone-Chettiar, Christopher Weyman, Bruce Spottiswoode, Vladimir Panin, Kuangyu Shi, Edward J. Miller, Attila Feher, Albert J. Sinusas, Nicha C. Dvornek, Chi Liu2026-03-10💻 cs

MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

This paper introduces MedQ-Deg, a comprehensive benchmark featuring 24,894 expert-calibrated question-answer pairs across 18 degradation types and 7 imaging modalities, which reveals that mainstream medical multimodal large language models suffer systematic performance drops and exhibit the "AI Dunning-Kruger Effect" of overconfidence under image quality degradations.

Jiyao Liu, Junzhi Ning, Chenglong Ma, Wanying Qu, Jianghan Shen, Siqi Luo, Jinjie Wei, Jin Ye, Pengze Li, Tianbin Li, Jiashi Lin, Hongming Shan, Xinzhe Luo, Xiaohong Liu, Lihao Liu, Junjun He, Ningsheng Xu2026-03-10💻 cs

Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery

This paper proposes the Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework to address data heterogeneity in remote sensing satellite imagery by leveraging aggregated geometric knowledge from local covariance matrices and a dual distillation process to significantly outperform state-of-the-art methods.

Luyao Zou, Fei Pan, Jueying Li, Yan Kyaw Tun, Apurba Adhikary, Zhu Han, Hayoung Oh2026-03-10💻 cs

SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

The paper proposes Structured Gaussian Image (SGI), a framework that represents high-resolution images using multi-scale, seed-based structured 2D Gaussians generated by lightweight MLPs, achieving significant compression and faster convergence compared to existing unstructured 2D Gaussian methods while maintaining high image fidelity.

Zixuan Pan, Kaiyuan Tang, Jun Xia, Yifan Qin, Lin Gu, Chaoli Wang, Jianxu Chen, Yiyu Shi2026-03-10💻 cs