cs.CV papers | Gist.Science

Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework

This paper proposes a Self-Critical Inference (SCI) framework that enhances the robustness of Large Vision-Language Models against language bias and sensitivity through multi-round counterfactual reasoning with textual and visual perturbations, alongside a new Dynamic Robustness Benchmark (DRBench) for model-specific evaluation.

Kaihua Tang, Jiaxin Qi, Jinli Ou, Yuhua Zheng, Jianqiang Huang2026-03-10💻 cs

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

This paper introduces Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video streams without human intervention, which provides 4 million high-quality 3D semantic annotations and spatial QA pairs to significantly enhance the training and performance of Vision-Language Models on spatial reasoning tasks.

Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, Zhihang Zhong2026-03-10💻 cs

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS is an efficient, rasterization-based framework that achieves state-of-the-art novel view synthesis on reflective scenes by decoupling surface geometry from specular reflections using a dual Gaussian representation and a lightweight adaptive mixing shader, thereby avoiding the high computational cost of explicit ray tracing.

Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka2026-03-10💻 cs

FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration

This paper introduces FusionRegister, a general and efficient cross-modality registration framework guided by visual priors that directly corrects misalignment within fused infrared and visible images, thereby enhancing detail alignment and robustness without requiring extensive pre-registration.

Congcong Bian, Haolong Ma, Hui Li, Zhongwei Shen, Xiaoqing Luo, Xiaoning Song, Xiao-Jun Wu2026-03-10💻 cs

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

UniUncer is a lightweight, unified framework for end-to-end autonomous driving that jointly estimates and leverages uncertainty for both static map elements and dynamic agents through probabilistic regression, uncertainty-aware query fusion, and adaptive gating, thereby significantly improving trajectory accuracy and planning robustness with minimal computational overhead.

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao2026-03-10💻 cs

FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT

FrameVGGT addresses the unbounded memory growth in streaming Visual Geometry Transformers by introducing a frame-driven rolling explicit-memory framework that aggregates frame-level evidence into compact prototypes, enabling stable long-sequence 3D perception under strict memory budgets.

Zhisong Xu, Takeshi Oishi2026-03-10💻 cs

RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation

This paper introduces RoboPCA, a pose-centered affordance learning framework that jointly predicts task-appropriate contact regions and poses from human demonstrations via the Human2Afford data curation pipeline, enabling robots to effectively manipulate objects with improved consistency and generalization across tasks and categories.

Zhanqi Xiao, Ruiping Wang, Xilin Chen2026-03-10💻 cs

Compressed-Domain-Aware Online Video Super-Resolution

This paper proposes CDA-VSR, a compressed-domain-aware online video super-resolution network that leverages motion vectors, residual maps, and frame types to achieve real-time, high-quality reconstruction with significantly reduced computational cost compared to state-of-the-art methods.

Yuhang Wang, Hai Li, Shujuan Hou, Zhetao Dong, Xiaoyao Yang2026-03-10💻 cs

Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation

This paper introduces the Masked Motion Diffusion Model (MMDM), a diffusion-based framework equipped with a Kinematic Attention Aggregation mechanism that learns context-adaptive motion priors to effectively reconstruct, refine, and complete 3D human motion from incomplete or noisy data.

Junkun Jiang, Jie Chen, Ho Yin Au, Jingyu Xiang2026-03-10💻 cs

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

TDM-R1 introduces a novel reinforcement learning paradigm that enables few-step diffusion models to effectively incorporate non-differentiable rewards by decoupling surrogate reward learning from generator training, achieving state-of-the-art performance across various metrics and scaling to powerful models like Z-Image with only 4 inference steps.

Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang2026-03-10💻 cs

PARSE: Part-Aware Relational Spatial Modeling

The paper introduces PARSE, a framework utilizing part-level geometric relations encoded in Part-centric Assembly Graphs to resolve spatial ambiguities, which is validated through the creation of the PARSE-10K dataset and demonstrated to significantly enhance both object layout reasoning in vision-language models and the physical realism of generated 3D scenes.

Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu2026-03-10💻 cs

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

To address the "spatial intelligence gap" where Vision-Language Models struggle with elementary 3D tasks despite strong logical reasoning, the paper introduces 3ViewSense, a framework that leverages an engineering-inspired "Simulate-and-Reason" mechanism to ground spatial understanding in orthographic views, significantly improving performance on occlusion-heavy counting and view-consistent reasoning benchmarks.

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng2026-03-10💬 cs.CL

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

The paper proposes AR2-4FV, a novel framework for long-term language-guided referring in fixed-view videos that leverages a static background-derived Anchor Bank and a ReID-Gating mechanism to maintain identity continuity and accelerate re-capture during occlusions or scene exits, significantly outperforming existing baselines in re-capture rate and latency.

Teng Yan, Yihan Liu, Jiongxu Chen, Teng Wang, Jiaqi Li, Bingzhuo Zhong2026-03-10💻 cs

DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising

The paper proposes DECADE, an unsupervised diffusion model that achieves temporally consistent denoising of Rb-82 dynamic cardiac PET images without paired training data, effectively reducing noise while preserving quantitative accuracy for myocardial blood flow and flow reserve metrics.

Yinchi Zhou, Liang Guo, Huidong Xie, Yuexi Du, Ashley Wang, Menghua Xia, Tian Yu, Ramesh Fazzone-Chettiar, Christopher Weyman, Bruce Spottiswoode, Vladimir Panin, Kuangyu Shi, Edward J. Miller, Attila Feher, Albert J. Sinusas, Nicha C. Dvornek, Chi Liu2026-03-10💻 cs

MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

This paper introduces MedQ-Deg, a comprehensive benchmark featuring 24,894 expert-calibrated question-answer pairs across 18 degradation types and 7 imaging modalities, which reveals that mainstream medical multimodal large language models suffer systematic performance drops and exhibit the "AI Dunning-Kruger Effect" of overconfidence under image quality degradations.

Jiyao Liu, Junzhi Ning, Chenglong Ma, Wanying Qu, Jianghan Shen, Siqi Luo, Jinjie Wei, Jin Ye, Pengze Li, Tianbin Li, Jiashi Lin, Hongming Shan, Xinzhe Luo, Xiaohong Liu, Lihao Liu, Junjun He, Ningsheng Xu2026-03-10💻 cs

Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery

This paper proposes the Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework to address data heterogeneity in remote sensing satellite imagery by leveraging aggregated geometric knowledge from local covariance matrices and a dual distillation process to significantly outperform state-of-the-art methods.

Luyao Zou, Fei Pan, Jueying Li, Yan Kyaw Tun, Apurba Adhikary, Zhu Han, Hayoung Oh2026-03-10💻 cs

Parameterized Brushstroke Style Transfer

This paper proposes a novel style transfer method that represents images in the brushstroke domain rather than the traditional pixel domain, offering a more natural and visually superior approach to mimicking artistic styles.

Uma Meleti, Siyu Huang2026-03-10💻 cs

OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

The paper introduces OrdinalBench, a comprehensive benchmark dataset and evaluation framework designed to diagnose and expose the significant generalization limitations of Vision-Language Models in understanding ordinal numbers and performing sequential reasoning tasks involving large indices and complex paths.

Yusuke Tozaki, Hisashi Miyamori2026-03-10💻 cs

SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

The paper proposes Structured Gaussian Image (SGI), a framework that represents high-resolution images using multi-scale, seed-based structured 2D Gaussians generated by lightweight MLPs, achieving significant compression and faster convergence compared to existing unstructured 2D Gaussian methods while maintaining high image fidelity.

Zixuan Pan, Kaiyuan Tang, Jun Xia, Yifan Qin, Lin Gu, Chaoli Wang, Jianxu Chen, Yiyu Shi2026-03-10💻 cs

4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera

This paper introduces 4DRC-OCC, the first framework to fuse 4D radar and camera data for robust 3D semantic occupancy prediction, leveraging their complementary strengths to overcome adverse weather and lighting challenges while utilizing a newly created automatically labeled dataset to reduce annotation costs.

David Ninfa, Andras Palffy, Holger Caesar2026-03-10💻 cs

← Previous Next →