Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

This paper introduces CourtSI, a large-scale dataset and benchmark for evaluating spatial intelligence in vision-language models within sports scenarios, revealing significant performance gaps in existing models while demonstrating that fine-tuning on this data substantially improves accuracy and generalization.

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang ZhongWed, 11 Ma💻 cs

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction videos by utilizing sparse motion guidance (wrist coordinates and object bounding boxes), an object-stressed attention mechanism, and a multi-task auxiliary training strategy to overcome limitations in flexibility, generalization, and data scarcity.

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong WangWed, 11 Ma💻 cs

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

The paper introduces InternVL-U, a lightweight 4B-parameter unified multimodal model that democratizes advanced understanding, reasoning, generation, and editing capabilities by employing a modular architecture and a reasoning-centric data synthesis pipeline, achieving superior performance-efficiency balance that outperforms significantly larger baselines like BAGEL.

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie ZhangWed, 11 Ma💻 cs

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

This paper introduces MissBench, a benchmark and framework for multimodal affective computing that addresses the gap in evaluating models under realistic, imbalanced missing modality conditions by standardizing protocols and proposing new diagnostic metrics (MEI and MLI) to reveal hidden modality inequities and optimization imbalances.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi NguyenWed, 11 Ma💻 cs

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

This paper introduces VLM-Loc, a framework that leverages large vision-language models to achieve precise text-to-point-cloud localization by transforming 3D maps into bird's-eye-view images and scene graphs for enhanced spatial reasoning, alongside the release of the CityLoc benchmark for systematic evaluation.

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun LiuWed, 11 Ma💻 cs

BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling

BrainSTR is a spatio-temporal contrastive learning framework that enhances the interpretability of dynamic brain network modeling for neuropsychiatric diagnosis by adaptively partitioning brain states, identifying critical phases, and extracting sparse, disease-specific connectivity patterns to construct a discriminative semantic space validated across ASD, BD, and MDD datasets.

Guiliang Guo, Guangqi Wen, Lingwen Liu, Ruoxian Song, Peng Cao, Jinzhu Yang, Fei Wang, Xiaoli Liu, Osmar R. ZaianeWed, 11 Ma💻 cs

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

ConfCtrl is a confidence-aware video interpolation framework that enables precise camera control in video diffusion for novel view synthesis by combining confidence-weighted point cloud projections with a Kalman-inspired predict-update mechanism to balance pose guidance and geometric consistency while reconstructing unseen regions.

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav ValadaWed, 11 Ma💻 cs

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

This paper introduces a new fine-grained Audio-Visual Learning task called Region-Aware Sound Source Understanding (RA-SSU), supported by two novel datasets (f-Music and f-Lifescene) and a state-of-the-art model named SSUFormer, which utilizes specialized modules to achieve precise sound source segmentation and detailed frame-level textual descriptions.

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan SunWed, 11 Ma💻 cs

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This paper introduces Test-time Ego-Exo Adaptation for Action Anticipation (TE2^{2}A3^{3}), a novel task addressed by the Dual-Clue enhanced Prototype Growing Network (DCPGN) which utilizes a Multi-Label Prototype Growing Module and a Dual-Clue Consistency Module to effectively bridge the inter-view gap and adapt models online without target-view training data.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang LiWed, 11 Ma💻 cs

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie LowryWed, 11 Ma💻 cs

ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

This paper introduces ENIGMA-360, a publicly released, temporally synchronized ego-exo dataset containing 360 annotated procedural videos from real industrial scenarios to advance human behavior understanding and establish baselines for tasks like action segmentation and interaction detection.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D'Ambra, Antonino Furnari, Giovanni Maria FarinellaWed, 11 Ma💻 cs

Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

This paper introduces Step-Aware Contrastive Alignment (SACA), a novel framework that enhances Vision-Language Navigation in Continuous Environments by utilizing a perception-grounded auditor to extract dense, step-level supervision from imperfect trajectories, thereby overcoming the limitations of compounding errors in supervised fine-tuning and sparse rewards in reinforcement fine-tuning to achieve state-of-the-art performance.

Haoyuan Li, Rui Liu, Hehe Fan, Yi YangWed, 11 Ma💻 cs

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

FetalAgents is a novel multi-agent system that dynamically orchestrates specialized vision experts to deliver robust, end-to-end fetal ultrasound analysis and structured clinical reporting across multiple tasks, outperforming existing specialized models and multimodal large language models.

Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan TianWed, 11 Ma💻 cs

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

The paper proposes FrameDiT, a novel video generation architecture that introduces Matrix Attention to efficiently model global spatio-temporal dynamics by processing frames as matrices, thereby achieving state-of-the-art video quality and temporal coherence while maintaining computational efficiency comparable to local factorized attention.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen TranWed, 11 Ma💻 cs