ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

This paper introduces ENIGMA-360, a publicly released, temporally synchronized ego-exo dataset containing 360 annotated procedural videos from real industrial scenarios to advance human behavior understanding and establish baselines for tasks like action segmentation and interaction detection.

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D'Ambra, Antonino Furnari, Giovanni Maria FarinellaWed, 11 Ma💻 cs

LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

This paper introduces LAP, a novel procedure planning model that leverages a fine-tuned Vision Language Model to convert visual observations into distinctive text embeddings for a diffusion-based planner, achieving state-of-the-art performance on multiple benchmarks by effectively resolving visual ambiguities through language.

Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie LowryWed, 11 Ma💻 cs

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

This paper introduces PanoAffordanceNet, a novel framework and the first high-quality dataset (360-AGD) designed to enable holistic affordance grounding in 360-degree indoor environments by addressing challenges like geometric distortion and semantic dispersion through distortion-aware calibration and multi-level constraints.

Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li, Kailun YangWed, 11 Ma⚡ eess

Ego: Embedding-Guided Personalization of Vision-Language Models

The paper proposes "Ego," an efficient personalization method for vision-language models that extracts visual tokens representing target concepts via internal attention mechanisms to serve as memory, enabling strong performance across single-concept, multi-concept, and video personalization tasks without requiring additional training stages or external modules.

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf AljundiWed, 11 Ma🤖 cs.AI

What is Missing? Explaining Neurons Activated by Absent Concepts

This paper identifies that deep neural networks frequently encode the absence of concepts to drive neuron activation—a phenomenon largely overlooked by standard explainable AI methods—and proposes simple extensions to attribution and feature visualization techniques to effectively reveal and leverage these "missing" concepts for better model interpretation and debiasing.

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan RothWed, 11 Ma🤖 cs.LG

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This paper introduces Test-time Ego-Exo Adaptation for Action Anticipation (TE2^{2}A3^{3}), a novel task addressed by the Dual-Clue enhanced Prototype Growing Network (DCPGN) which utilizes a Multi-Label Prototype Growing Module and a Dual-Clue Consistency Module to effectively bridge the inter-view gap and adapt models online without target-view training data.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang LiWed, 11 Ma💻 cs

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

This paper introduces a new fine-grained Audio-Visual Learning task called Region-Aware Sound Source Understanding (RA-SSU), supported by two novel datasets (f-Music and f-Lifescene) and a state-of-the-art model named SSUFormer, which utilizes specialized modules to achieve precise sound source segmentation and detailed frame-level textual descriptions.

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan SunWed, 11 Ma💻 cs

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

ConfCtrl is a confidence-aware video interpolation framework that enables precise camera control in video diffusion for novel view synthesis by combining confidence-weighted point cloud projections with a Kalman-inspired predict-update mechanism to balance pose guidance and geometric consistency while reconstructing unseen regions.

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav ValadaWed, 11 Ma💻 cs

BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling

BrainSTR is a spatio-temporal contrastive learning framework that enhances the interpretability of dynamic brain network modeling for neuropsychiatric diagnosis by adaptively partitioning brain states, identifying critical phases, and extracting sparse, disease-specific connectivity patterns to construct a discriminative semantic space validated across ASD, BD, and MDD datasets.

Guiliang Guo, Guangqi Wen, Lingwen Liu, Ruoxian Song, Peng Cao, Jinzhu Yang, Fei Wang, Xiaoli Liu, Osmar R. ZaianeWed, 11 Ma💻 cs

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

This paper introduces VLM-Loc, a framework that leverages large vision-language models to achieve precise text-to-point-cloud localization by transforming 3D maps into bird's-eye-view images and scene graphs for enhanced spatial reasoning, alongside the release of the CityLoc benchmark for systematic evaluation.

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun LiuWed, 11 Ma💻 cs

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

This paper introduces MA-EgoQA, a novel benchmark and dataset featuring 1,700 questions across five categories designed to evaluate the ability of AI models to understand and reason over multiple long-horizon egocentric videos from embodied agents, alongside a proposed baseline model named EgoMAS that highlights current limitations in system-level multi-agent understanding.

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju HwangWed, 11 Ma🤖 cs.AI

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

CycleULM is a novel, label-free deep learning framework that leverages CycleGAN to bridge the simulation-to-reality gap in ultrasound localisation microscopy, significantly enhancing microbubble localisation accuracy, image resolution, and processing speed for real-time clinical application without requiring paired ground truth data.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung, Herman Verinaz-Jadan, Jiakang Chen, Matthieu Toulemonde, Kai Riemer, Jipeng Yan, Clotilde Vié, Qingyuan Tan, Peter D. Weinberg, Pier Luigi Dragotti, Kevin G. Murphy, Meng-Xing TangWed, 11 Ma⚡ eess

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

This paper introduces MissBench, a benchmark and framework for multimodal affective computing that addresses the gap in evaluating models under realistic, imbalanced missing modality conditions by standardizing protocols and proposing new diagnostic metrics (MEI and MLI) to reveal hidden modality inequities and optimization imbalances.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi NguyenWed, 11 Ma💻 cs

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

The paper introduces InternVL-U, a lightweight 4B-parameter unified multimodal model that democratizes advanced understanding, reasoning, generation, and editing capabilities by employing a modular architecture and a reasoning-centric data synthesis pipeline, achieving superior performance-efficiency balance that outperforms significantly larger baselines like BAGEL.

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie ZhangWed, 11 Ma💻 cs

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction videos by utilizing sparse motion guidance (wrist coordinates and object bounding boxes), an object-stressed attention mechanism, and a multi-task auxiliary training strategy to overcome limitations in flexibility, generalization, and data scarcity.

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong WangWed, 11 Ma💻 cs

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

This paper introduces CourtSI, a large-scale dataset and benchmark for evaluating spatial intelligence in vision-language models within sports scenarios, revealing significant performance gaps in existing models while demonstrating that fine-tuning on this data substantially improves accuracy and generalization.

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang ZhongWed, 11 Ma💻 cs