SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

The paper introduces SciTune, a framework that aligns large language models with human-curated scientific multimodal instructions, resulting in a model (LLaMA-SciTune) that significantly outperforms state-of-the-art systems on scientific visual and language benchmarks, even surpassing human performance in certain categories.

Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge, Karl Pazdernik2026-04-14💬 cs.CL

HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images

This paper proposes HFI, a training-free and efficient framework that detects latent diffusion model-generated images and performs implicit watermarking by measuring aliasing artifacts in reconstructed images, thereby overcoming the limitations of existing methods that rely on reconstruction distance overfitted to background information.

Sungik Choi, Hankook Lee, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee2026-04-14🤖 cs.LG

RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

The paper introduces RobustSpring, a comprehensive benchmark and dataset that evaluates the robustness of optical flow, scene flow, and stereo vision models against 20 types of image corruptions, addressing the gap in existing benchmarks that primarily focus on accuracy rather than resilience to real-world perturbations.

Victor Oei, Jenny Schmalfuss, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andrés Bruhn2026-04-14🤖 cs.LG

IMPLICITSTAINER: Resolution Agnostic Data-Efficient Virtual Staining Using Neural Implicit Functions

The paper introduces IMPLICITSTAINER, a deterministic, resolution-agnostic deep learning framework that utilizes neural implicit functions to efficiently generate high-fidelity, reproducible virtual immunohistochemical stains from H&E images, overcoming the limitations of existing patch-based and stochastic methods in clinical applications.

Tushar Kataria, Beatrice Knudsen, Shireen Y. Elhabian2026-04-14⚡ eess

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

This paper introduces DeceptionDecoded, a large-scale benchmark and intent-guided simulation framework designed to evaluate and improve vision-language models' ability to detect misleading creator intent in multimodal news, addressing their current reliance on superficial cues and enhancing their robustness in misinformation governance.

Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi2026-04-14💬 cs.CL

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

GoT-R1 is a novel framework that leverages reinforcement learning with a dual-stage multi-dimensional reward system to enhance the semantic-spatial reasoning capabilities of multimodal large language models, significantly improving their ability to generate images from complex prompts involving precise object relationships and attributes.

Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu2026-04-14💬 cs.CL

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit introduces a unified discrete diffusion transformer that leverages pretrained visual priors and a lightweight text decoder to achieve fast, parallel, and high-quality generation across both text and image modalities, outperforming larger autoregressive models in efficiency and quality.

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan2026-04-14🤖 cs.LG

VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites

This paper introduces VisText-Mosquito, a unified multimodal dataset and framework that integrates visual detection, segmentation, and textual explanation to enable AI-driven proactive identification and analysis of mosquito breeding sites for disease prevention.

Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda2026-04-14💬 cs.CL

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

PRIX is a lightweight, camera-only end-to-end autonomous driving framework that utilizes a Context-aware Recalibration Transformer and a generative planning head to predict safe trajectories directly from raw pixels, achieving state-of-the-art performance on NavSim and nuScenes benchmarks while significantly reducing model size and inference costs compared to LiDAR-dependent or BEV-based approaches.

Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt2026-04-14🤖 cs.LG

DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation

This paper introduces DoSReMC, a domain shift resilient framework that enhances cross-domain generalization in mammography classification by fine-tuning only batch normalization and fully connected layers alongside adversarial training, thereby addressing performance degradation caused by data distribution variations without requiring full model retraining.

U\u{g}urcan Akyüz, Deniz Katircioglu-Öztürk, Emre K. Süslü, Burhan Keles, Mete C. Kaya, Gamze Durhan, Meltem G. Akpınar, Figen B. Demirkazık, Gözde B. Akar2026-04-14⚡ eess

Delta Rectified Flow Sampling for Text-to-Image Editing

The paper proposes Delta Rectified Flow Sampling (DRFS), a novel inversion-free framework for text-to-image editing that mitigates over-smoothing artifacts by explicitly modeling velocity field discrepancies and introducing a time-dependent shift term, thereby unifying optimization and ODE editing approaches while achieving superior quality and controllability on the PIE Benchmark.

Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang2026-04-14🤖 cs.LG

CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

CARINOX is a unified framework that enhances the compositional alignment of text-to-image diffusion models by synergizing initial noise optimization and exploration with a principled, human-judgment-correlated reward selection strategy, achieving significant performance gains over state-of-the-art methods without requiring model fine-tuning.

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban2026-04-14💬 cs.CL

Inferring Dynamic Physical Properties from Video Foundation Models

This paper introduces new synthetic and real-world video datasets for predicting dynamic physical properties like elasticity, viscosity, and friction, and evaluates various inference methods—including classical computer vision, prompt-based adaptation of video foundation models, and multi-modal large language models—demonstrating that pre-trained generative and self-supervised video models achieve performance comparable to each other and approaching an oracle baseline, while currently outperforming MLLMs.

Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman2026-04-14🤖 cs.LG