ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

This paper presents ImagiDrive, a unified end-to-end autonomous driving framework that synergistically integrates a Vision-Language Model-based driving agent with a Driving World Model-based scene imaginer to iteratively refine planning decisions through a closed-loop imagination-and-planning process, demonstrating superior robustness and performance on nuScenes and NAVSIM datasets.

Jingyu Li, Bozhou Zhang, Xin Jin + 3 more2026-03-03💻 cs

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

This paper introduces CineTrans, a novel framework that leverages a newly constructed Cine250K dataset and a training-free, mask-based control mechanism derived from attention map analysis to generate coherent, cinematic multi-shot videos with stable, film-style transitions, significantly outperforming existing baselines in transition control and temporal consistency.

Xiaoxue Wu, Bingjie Gao, Yu Qiao + 2 more2026-03-03💻 cs

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

This paper introduces MOON, the first generative Multimodal Large Language Model designed for e-commerce product understanding, which leverages guided Mixture-of-Experts, semantic region detection, and specialized negative sampling to overcome existing alignment and noise challenges while establishing a new large-scale benchmark for evaluation.

Daoze Zhang, Chenghan Fu, Zhanheng Nie + 7 more2026-03-03🤖 cs.AI

Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

This paper proposes a disentangled multi-modal learning framework that addresses heterogeneity, multi-scale integration, and data dependency challenges in cancer characterization by decomposing histology and transcriptomics into tumor and microenvironment subspaces, aligning signals across magnifications, enabling transcriptome-agnostic inference, and aggregating informative tokens to outperform state-of-the-art methods in diagnosis, prognosis, and survival prediction.

Yupei Zhang, Xiaofei Wang, Anran Liu + 2 more2026-03-03⚡ eess

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

This paper proposes TADSR, a time-aware one-step diffusion network that enhances real-world image super-resolution by introducing a time-aware VAE encoder and a time-aware VSD loss to fully leverage the generative priors of pre-trained stable diffusion models across different timesteps, achieving state-of-the-art performance with controllable fidelity-realism trade-offs in a single step.

Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang + 4 more2026-03-03⚡ eess

RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

The paper introduces RTGMFF, a novel multimodal framework that enhances fMRI-based brain disorder diagnosis by integrating deterministic ROI-driven text generation with a hybrid frequency-spatial encoder and adaptive semantic alignment to overcome signal noise and inter-subject variability, achieving superior performance on ADHD-200 and ABIDE benchmarks.

Junhao Jia, Yifei Sun, Yunyou Liu + 5 more2026-03-03💻 cs

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

This paper introduces T2I-CoReBench, a comprehensive benchmark featuring 1,080 complex prompts and a 12-dimensional taxonomy to rigorously evaluate text-to-image models' composition and reasoning capabilities, revealing that while models struggle with high-density composition, their ability to perform implicit reasoning remains a critical bottleneck.

Ouxiang Li, Yuan Wang, Xinting Hu + 7 more2026-03-03💻 cs

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

UniView addresses the ill-posed nature of single-image novel view synthesis by leveraging a multimodal large language model to retrieve similar reference images and integrating their features through a plug-and-play adapter with a decoupled triple attention mechanism, thereby significantly reducing distortions and outperforming state-of-the-art methods.

Haowang Cui, Rui Chen, Jiaze Wang + 2 more2026-03-03💻 cs

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

This paper presents an improved 3D scene stylization framework that leverages text-guided generative image editing with a reference-based attention mechanism and multi-depth view generation to ensure high-quality, view-consistent results, while introducing a novel region-controlled loss function for applying distinct styles to specific semantic areas within a scene.

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada2026-03-03💻 cs