Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

This paper proposes a disentangled multi-modal learning framework that addresses heterogeneity, multi-scale integration, and data dependency challenges in cancer characterization by decomposing histology and transcriptomics into tumor and microenvironment subspaces, aligning signals across magnifications, enabling transcriptome-agnostic inference, and aggregating informative tokens to outperform state-of-the-art methods in diagnosis, prognosis, and survival prediction.

Yupei Zhang, Xiaofei Wang, Anran Liu + 2 more2026-03-03⚡ eess

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

This paper proposes TADSR, a time-aware one-step diffusion network that enhances real-world image super-resolution by introducing a time-aware VAE encoder and a time-aware VSD loss to fully leverage the generative priors of pre-trained stable diffusion models across different timesteps, achieving state-of-the-art performance with controllable fidelity-realism trade-offs in a single step.

Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang + 4 more2026-03-03⚡ eess

RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

The paper introduces RTGMFF, a novel multimodal framework that enhances fMRI-based brain disorder diagnosis by integrating deterministic ROI-driven text generation with a hybrid frequency-spatial encoder and adaptive semantic alignment to overcome signal noise and inter-subject variability, achieving superior performance on ADHD-200 and ABIDE benchmarks.

Junhao Jia, Yifei Sun, Yunyou Liu + 5 more2026-03-03💻 cs

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

This paper introduces T2I-CoReBench, a comprehensive benchmark featuring 1,080 complex prompts and a 12-dimensional taxonomy to rigorously evaluate text-to-image models' composition and reasoning capabilities, revealing that while models struggle with high-density composition, their ability to perform implicit reasoning remains a critical bottleneck.

Ouxiang Li, Yuan Wang, Xinting Hu + 7 more2026-03-03💻 cs

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

UniView addresses the ill-posed nature of single-image novel view synthesis by leveraging a multimodal large language model to retrieve similar reference images and integrating their features through a plug-and-play adapter with a decoupled triple attention mechanism, thereby significantly reducing distortions and outperforming state-of-the-art methods.

Haowang Cui, Rui Chen, Jiaze Wang + 2 more2026-03-03💻 cs

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

This paper presents an improved 3D scene stylization framework that leverages text-guided generative image editing with a reference-based attention mechanism and multi-depth view generation to ensure high-quality, view-consistent results, while introducing a novel region-controlled loss function for applying distinct styles to specific semantic areas within a scene.

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada2026-03-03💻 cs

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

This paper addresses the limitations of existing visual emotion evaluation methods for Multimodal Large Language Models (MLLMs) by proposing an open-vocabulary, automated Emotion Statement Judgment framework that reveals current models' strengths in context-based interpretation but highlights significant gaps in understanding subjective perception compared to humans.

Daiqing Wu, Dongbao Yang, Sicheng Zhao + 2 more2026-03-03💻 cs