MOSIV: Multi-Object System Identification from Videos

The paper introduces MOSIV, a novel framework that leverages differentiable simulation and geometry-aligned objectives to identify continuous, per-object material parameters from videos of complex multi-object interactions, outperforming existing methods on a new synthetic benchmark.

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A. Jeni, Min Xu, Yizhou Zhao2026-03-09💻 cs

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

StruVis is a novel, generator-agnostic framework that enhances reasoning-based text-to-image generation by enabling MLLMs to utilize text-based structured visual representations as intermediate reasoning states, thereby overcoming the limitations of existing text-only and text-image interleaved approaches to achieve significant performance gains on benchmarks.

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu2026-03-09💻 cs

Ensemble Learning with Sparse Hypercolumns

This paper addresses the computational challenges of using dense hypercolumns for image segmentation by introducing stratified subsampling and ensemble learning, demonstrating that these methods significantly outperform standard UNet baselines, particularly in low-shot scenarios where a simple Logistic Regression classifier achieves the best results.

Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O'Connor2026-03-09💻 cs

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

This paper introduces FontUse, a data-centric approach that leverages a large-scale, automatically annotated dataset of 70K images with style- and use-case-conditioned prompts to fine-tune text-to-image models, significantly improving their ability to generate typography that accurately reflects requested visual attributes without requiring architectural changes.

Xia Xin, Yuki Endo, Yoshihiro Kanamori2026-03-09💻 cs

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

This paper proposes GvU, a self-supervised reinforcement learning framework that leverages a unified multimodal model's own understanding branch as an intrinsic reward signal to iteratively improve its text-to-image generation quality, thereby narrowing the capability gap between visual understanding and generation.

Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang2026-03-09💻 cs

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

GenHOI is a lightweight augmentation for pretrained video generation models that enhances object-consistent hand-object interaction in in-the-wild scenarios by employing Head-Sliding RoPE for temporally balanced reference injection and a two-level spatial attention gate for selective focus on interaction regions.

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang2026-03-09💻 cs

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

The paper introduces Curious-VLA, a two-stage framework that overcomes the exploration limitations of standard driving VLA models by employing Feasible Trajectory Expansion during imitation learning and Adaptive Diversity-Aware Sampling with a Spanning Driving Reward during reinforcement learning, achieving state-of-the-art performance on the Navsim benchmark.

Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang2026-03-09💻 cs

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

This paper investigates failure modes in lightweight Vision-Language Models for automated driving by analyzing intermediate activations to reveal that while some visual concepts like object presence are linearly encoded, others like orientation are not, leading to either perceptual or cognitive failures that are further exacerbated by object distance.

Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising, Tim Brophy2026-03-09🤖 cs.AI

Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting

This paper presents a novel pipeline that transforms archived omnidirectional RGB-LiDAR logs into robust 3D Gaussian Splatting initialization assets by addressing sensor distortion and data density challenges through ERP-to-cubemap conversion, color-stratified downsampling, and multi-modal registration, thereby enabling the creation of high-fidelity digital twins from standard, underutilized sensor data.

Semin Bae, Hansol Lim, Jongseong Brad Choi2026-03-09💻 cs

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

FedARKS is a novel federated learning framework for person re-identification that overcomes the limitations of global feature reliance and uniform averaging by introducing Robust Knowledge and Knowledge Selection mechanisms to capture subtle domain-invariant details and prioritize high-quality client contributions for improved domain generalization.

Xin Xu, Binchang Ma, Zhixi Yu, Wei Liu2026-03-09💻 cs