UniCoR: Modality Collaboration for Robust Cross-Language Hybrid Code Retrieval

UniCoR is a novel self-supervised framework that addresses the challenges of insufficient semantic understanding, inefficient modality fusion, and weak cross-language generalization in hybrid code retrieval by employing multi-perspective supervised contrastive learning and representation distribution consistency, thereby achieving state-of-the-art performance on both empirical and large-scale benchmarks.

Yang Yang, Li Kuang, Jiakun Liu, Zhongxin Liu, Yingjie Xia, David Lo2026-03-09💻 cs

Towards Scalable Pre-training of Visual Tokenizers for Generation

This paper introduces VTP, a unified pre-training framework that optimizes visual tokenizers through joint image-text contrastive, self-supervised, and reconstruction losses to shift the latent space focus from low-level pixel accuracy to high-level semantics, thereby solving the "pre-training scaling problem" and enabling significantly improved, compute-efficient generative performance.

Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang2026-03-09💻 cs

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

This paper introduces Spatial4D-Bench, a large-scale, multi-task benchmark comprising approximately 40,000 question-answer pairs across 18 tasks and six cognitive categories, designed to comprehensively evaluate and reveal the current limitations of Multimodal Large Language Models in achieving human-level 4D spatial intelligence.

Pan Wang, Yang Liu, Guile Wu, Eduardo R. Corral-Soto, Chengjie Huang, Binbin Xu, Dongfeng Bai, Xu Yan, Yuan Ren, Xingxin Chen, Yizhe Wu, Tao Huang, Wenjun Wan, Xin Wu, Pei Zhou, Xuyang Dai, Kangbo Lv, Hongbo Zhang, Yosef Fried, Aixue Ye, Bailan Feng, Zhenyu Chen, Zhen Li, Yingcong Chen, Yiyi Liao, Bingbing Liu2026-03-09💻 cs

VISO: Robust Underwater Visual-Inertial-Sonar SLAM with Photometric Rendering for Dense 3D Reconstruction

This paper presents VISO, a robust underwater SLAM system that fuses stereo cameras, IMUs, and 3D sonar with novel calibration and photometric rendering techniques to achieve accurate 6-DoF localization and real-time, high-fidelity dense 3D reconstruction in challenging aquatic environments.

Shu Pan, Simon Archieri, Ahmet Cinar, Jonatan Scharff Willners, Ignacio Carlucho, Yvan Petillot2026-03-09💻 cs

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

This paper introduces SRA 2, a lightweight intrinsic guidance framework that accelerates diffusion transformer training and improves generation quality by aligning intermediate latent features with pre-trained VAE features via a simple projection layer, eliminating the need for external encoders or dual-model setups while incurring minimal computational overhead.

Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang2026-03-09💻 cs

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

The paper introduces SpatialReward, a reward model that leverages explicit spatial reasoning to overcome the "Attention Collapse" limitation in existing evaluators, thereby providing fine-grained, accurate signals that significantly enhance online reinforcement learning performance for image editing tasks.

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang2026-03-09💻 cs

APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

The paper presents APEX, a deep reinforcement learning framework that enables a 29-DoF Unitree G1 humanoid robot to autonomously traverse platforms up to 114% of its leg length by composing perceptive climbing, walking, and reconfiguration skills through a novel ratchet progress reward and robust sim-to-real perception strategies.

Yikai Wang, Tingxuan Leng, Changyi Lin, Shiqi Liu, Shir Simon, Bingqing Chen, Jonathan Francis, Ding Zhao2026-03-09💻 cs

Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

This paper proposes RL-Co, a reinforcement learning-based sim-real co-training framework that combines supervised fine-tuning on mixed real and simulated data with interactive simulation fine-tuning anchored by real-world data, achieving significant improvements in real-world success rates, generalization, and data efficiency for Vision-Language-Action models.

Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zang, Weinan Zhang, Chao Yu, Yu Wang2026-03-09💻 cs