SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

This paper introduces SJD-PV, a training-free acceleration framework for autoregressive image generation that leverages phrase-level speculative verification based on token co-occurrence statistics to jointly validate multiple correlated tokens, achieving up to 30% faster decoding without compromising visual fidelity.

Zhehao Yu, Baoquan Zhang, Bingqi Shan, Xinhao Liu, Dongliang Zhou, Guotao Liang, Guangming Ye, Yunming YeTue, 10 Ma💻 cs

Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning

This paper introduces SIL-GPO, a reinforcement learning framework that combines graph attention networks with self-imitation learning to optimize the joint deployment and routing of heterogeneous edge AI and microservices, significantly reducing end-to-end latency and improving resource utilization compared to existing methods.

Chen Yang, Jin Zheng, Yang Zhuolin, Lai Pan, Zhang Xiao, Hu Menglan, Yin HaiyanTue, 10 Ma💻 cs

AutoFigure-Edit: Generating Editable Scientific Illustration

AutoFigure-Edit is an end-to-end system that generates fully editable, high-quality scientific illustrations from long-form text with flexible style adaptation via reference images, leveraging long-context understanding and native SVG support to overcome limitations in editability and efficiency found in existing automated tools.

Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, Yue ZhangTue, 10 Ma💻 cs

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

This paper introduces VB, a novel benchmark designed to evaluate vision-language models' ability to determine image visibility and appropriately abstain from answering when evidence is insufficient, utilizing controlled minimal edits and specialized metrics to reveal that top-tier models like GPT-4o and Gemini 3.1 Pro significantly outperform open-source alternatives in confidence-aware accuracy and perspective reasoning.

Neil TripathiTue, 10 Ma💻 cs

RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review

The paper introduces RADAR, a multimodal benchmark comprising expert-annotated 3D abdominal CT scans and radiology report edits that enables the systematic evaluation of AI models on fine-grained clinical reasoning tasks, specifically image-text alignment and discrepancy assessment during the radiology report review process.

Zhaoyi Sun, Minal Jagtiani, Wen-wai Yim, Fei Xia, Martin Gunn, Meliha Yetisgen, Asma Ben AbachaTue, 10 Ma💻 cs

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

The paper proposes ECHO, a multi-agent framework that utilizes iterative hypergraph operations and a "Link-then-Bind" strategy to mitigate cascading errors in Multimedia Event Extraction, achieving significant performance improvements over state-of-the-art methods on the M2E2 benchmark.

Hailong Chu, Shuo Zhang, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Lei LiTue, 10 Ma💻 cs

Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind

This study evaluates the adaptability of the TerraMind geospatial foundation model to hyperspectral imaging tasks without native pretraining, finding that while band selection strategies allow for moderate performance, deep learning models with native spectral support remain superior, thereby highlighting the critical need for future architectures to incorporate native spectral tokenization.

Julia Anna Leonardi, Johannes Jakubik, Paolo Fraccaro, Maria Antonia BrovelliTue, 10 Ma💻 cs

HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training

This paper introduces HARP, a deep learning framework that harmonizes multi-site in-vivo diffusion MRI data by training exclusively on easily transportable phantom scans, thereby eliminating the need for impractical multi-site human cohorts while significantly reducing inter-scanner variability.

Hwihun Jeong, Qiang Liu, Kathryn E. Keenan, Elisabeth A. Wilde, Walter Schneider, Sudhir Pathak, Anthony Zuccolotto, Lauren J. O'Donnell, Lipeng Ning, Yogesh RathiTue, 10 Ma💻 cs

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

This paper introduces a method that enhances medical Vision-Language Models by using sequential eye-tracking data as supervision to train dedicated gaze tokens, enabling the models to mimic radiologists' visual search patterns and achieve state-of-the-art performance in both in-domain and out-of-domain medical reasoning tasks.

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin ZhaoTue, 10 Ma💻 cs

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

This paper investigates the severe dimensional collapse and resulting robustness fragility that occur when distilling a large Vision Transformer into capacity-constrained CNNs, revealing that while larger student models pack information densely but lose noise immunity, extremely small models act as robust low-pass filters due to fundamental geometric limitations in asymmetric cross-modal transfer.

Kabir ThayaniTue, 10 Ma💻 cs

SIQA: Toward Reliable Scientific Image Quality Assessment

This paper introduces the SIQA framework, which redefines scientific image quality assessment by distinguishing between perceptual alignment and scientific correctness, and demonstrates through a new benchmark that current multimodal models often achieve high scoring consistency with experts while lacking genuine scientific understanding.

Wenzhe Li, Liang Chen, Junying Wang, Yijing Guo, Ye Shen, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao ZhaiTue, 10 Ma💻 cs