TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

TiPToP is a modular, open-vocabulary robotic planning system that integrates pretrained vision foundation models with a Task and Motion Planner to solve multi-step manipulation tasks from RGB images and natural language instructions without requiring any robot-specific training data, achieving performance comparable to or better than fine-tuned vision-language-action models while enabling detailed failure mode analysis.

William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tomás Lozano-PérezWed, 11 Ma💻 cs

Kinodynamic Motion Retargeting for Humanoid Locomotion via Multi-Contact Whole-Body Trajectory Optimization

This paper introduces KDMR, a novel framework that formulates humanoid motion retargeting as a multi-contact whole-body trajectory optimization problem incorporating rigid-body dynamics and ground reaction forces to generate physically consistent, dynamically feasible locomotion trajectories that significantly outperform purely kinematic methods in both motion quality and downstream control policy performance.

Xiaoyu Zhang, Steven Haener, Varun Madabushi, Maegan TuckerWed, 11 Ma💻 cs

Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading

This paper introduces the concept of Whole Slide Difficulty (WSD), derived from diagnostic disagreements between expert and non-expert pathologists, and demonstrates that leveraging this metric through multi-task learning or weighted loss functions significantly improves the accuracy of prostate cancer Gleason grading in Multiple Instance Learning models, particularly for higher-grade cases.

Marie Arrivat, Rémy Peyret, Elsa Angelini, Pietro GoriWed, 11 Ma💻 cs

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

This paper proposes an interpretable text-motion retrieval framework that represents 3D human motion as joint-angle pseudo-images processed by Vision Transformers and aligns them with text via a token-wise late interaction mechanism, thereby overcoming the limitations of global-embedding methods by capturing fine-grained correspondences and improving retrieval accuracy.

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu XiaoWed, 11 Ma💻 cs

Role Classification of Hosts within Enterprise Networks Based on Connection Patterns

This paper addresses the problem of role classification in enterprise networks by introducing two practical algorithms that group hosts based on evolving connection patterns to simplify network management and enhance monitoring accuracy, demonstrating their effectiveness through commercial implementation and significant reduction in host grouping complexity.

Godfrey Tan, Massimiliano Poletto, John Guttag, Frans KaashoekWed, 11 Ma💻 cs

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

This paper introduces CourtSI, a large-scale dataset and benchmark for evaluating spatial intelligence in vision-language models within sports scenarios, revealing significant performance gaps in existing models while demonstrating that fine-tuning on this data substantially improves accuracy and generalization.

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang ZhongWed, 11 Ma💻 cs

Robust Cooperative Localization in Featureless Environments: A Comparative Study of DCL, StCL, CCL, CI, and Standard-CL

This paper presents a comparative study of five cooperative localization algorithms in featureless, GPS-denied environments, revealing that while Sequential and Standard methods offer high accuracy at the cost of filter inconsistency, Covariance Intersection provides the most balanced trade-off between accuracy and robustness for safety-critical applications.

Nivand Khosravi, Meysam Basiri, Rodrigo VenturaWed, 11 Ma💻 cs

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction videos by utilizing sparse motion guidance (wrist coordinates and object bounding boxes), an object-stressed attention mechanism, and a multi-task auxiliary training strategy to overcome limitations in flexibility, generalization, and data scarcity.

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong WangWed, 11 Ma💻 cs

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

The paper introduces InternVL-U, a lightweight 4B-parameter unified multimodal model that democratizes advanced understanding, reasoning, generation, and editing capabilities by employing a modular architecture and a reasoning-centric data synthesis pipeline, achieving superior performance-efficiency balance that outperforms significantly larger baselines like BAGEL.

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie ZhangWed, 11 Ma💻 cs

The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

This paper proposes a Capability Coherence System (CCS) that maps memory consistency models to identity management, demonstrating through simulation that a Release Consistency-directed revocation strategy (RCC) achieves a constant bound on unauthorized operations independent of agent velocity, thereby outperforming traditional time-bounded approaches by orders of magnitude in high-speed agentic environments.

Vladyslav ParakhinWed, 11 Ma💻 cs

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

This paper introduces MissBench, a benchmark and framework for multimodal affective computing that addresses the gap in evaluating models under realistic, imbalanced missing modality conditions by standardizing protocols and proposing new diagnostic metrics (MEI and MLI) to reveal hidden modality inequities and optimization imbalances.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi NguyenWed, 11 Ma💻 cs