HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

This paper introduces HSSBench, a comprehensive multilingual benchmark featuring over 13,000 samples generated through a novel expert-agent collaboration pipeline, designed to evaluate and address the current limitations of Multimodal Large Language Models in handling the interdisciplinary and abstract reasoning tasks characteristic of the Humanities and Social Sciences.

Zhaolu Kang, Junhao Gong, Jiaxu Yan + 15 more2026-03-04🤖 cs.AI

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Perception-R1 addresses the limitation of existing RLVR methods in enhancing multimodal perception by introducing a novel visual perception reward derived from Chain-of-Thought annotations, which effectively boosts both perception and reasoning capabilities of Multimodal Large Language Models to achieve state-of-the-art performance with minimal training data.

Tong Xiao, Xin Xu, Zhenya Huang + 4 more2026-03-04🤖 cs.AI

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

StreamSplat is a fully feed-forward framework that enables real-time, online reconstruction of dynamic 3D scenes from uncalibrated video streams into 3D Gaussian Splatting representations, achieving state-of-the-art quality with a 1200x speedup over traditional optimization-based methods through probabilistic sampling, bidirectional deformation, and adaptive Gaussian fusion.

Zike Wu, Qi Yan, Xuanyu Yi + 2 more2026-03-04🤖 cs.LG

SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction

SceneStreamer is a unified autoregressive transformer framework that generates continuous, long-horizon traffic scenarios by predicting sequences of tokens representing dynamic elements like agents and traffic signals, thereby enabling the creation of realistic, diverse, and adaptive environments that significantly improve the robustness and generalization of autonomous driving policies.

Zhenghao Peng, Yuxin Liu, Bolei Zhou2026-03-04💻 cs

MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

This paper proposes MC-INR, a novel framework that leverages meta-learning, dynamic error-based re-clustering, and a branched architecture to efficiently encode complex multivariate scientific simulation data on unstructured grids, overcoming the inflexibility and single-variable limitations of existing Implicit Neural Representation methods.

Hyunsoo Son, Jeonghyun Noh, Suemin Jeon + 2 more2026-03-04🤖 cs.LG

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

InstructVLA introduces a novel Vision-Language-Action Instruction Tuning paradigm that successfully bridges flexible multimodal reasoning and precise manipulation by jointly optimizing embodied reasoning and action generation, thereby achieving state-of-the-art performance in both simulated and real-world robotic tasks without sacrificing pre-trained capabilities.

Shuai Yang, Hao Li, Bin Wang + 7 more2026-03-04💻 cs

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

This paper systematically evaluates Vision-Language Models' capabilities in autonomous driving lane topology awareness through a new BEV-based diagnostic framework, revealing that while performance correlates with model size and reasoning depth, current models—including frontier closed-source systems—still struggle with fundamental spatial reasoning tasks essential for safe navigation.

Xin Chen, Jia He, Maozheng Li + 5 more2026-03-04💻 cs