SGDFuse: SAM-Guided Diffusion Model for High-Fidelity Infrared and Visible Image Fusion

The paper proposes SGDFuse, a novel two-stage conditional diffusion model guided by Segment Anything Model (SAM) semantic masks, which achieves high-fidelity infrared and visible image fusion by leveraging explicit semantic priors to preserve key targets and minimize artifacts for superior downstream task performance.

Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot2026-03-09🤖 cs.AI

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

This paper introduces "Answer-Then-Check," a novel safety alignment method that enhances LLM robustness against jailbreak attacks by training models to generate direct answers internally and then critically evaluate their safety before responding, achieving superior protection with reduced over-refusal while maintaining general reasoning capabilities through the newly constructed 80K-sample ReSA dataset.

Chentao Cao, Xiaojun Xu, Bo Han, Hang Li2026-03-09🤖 cs.AI

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

This paper addresses the inconsistency and structural biases in existing latency metrics for simultaneous speech-to-text translation by introducing a comprehensive meta-evaluation, proposing new metrics (YAAL and LongYAAL) and a resegmentation tool (SoftSegmenter), and implementing these solutions within the OmniSTEval toolkit to enable more reliable system assessments.

Peter Polák, Sara Papi, Luisa Bentivogli, Ondřej Bojar2026-03-09🤖 cs.AI

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

The paper introduces LikePhys, a training-free evaluation method using likelihood preferences to assess intuitive physics understanding in video diffusion models, demonstrating that current models show improving capabilities in physical reasoning as they scale despite challenges with complex dynamics.

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini2026-03-09🤖 cs.AI

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

This paper introduces "Just-In-Time Objectives," a framework that passively observes user behavior to infer and rapidly optimize for specific, real-time goals, enabling large language models to generate specialized tools and responses that significantly outperform standard generic interactions.

Michelle S. Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A. Landay, Michael S. Bernstein2026-03-09🤖 cs.AI

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

The paper introduces 3DThinker, a novel framework that enables vision-language models to perform 3D spatial reasoning from limited views by aligning their internal representations with a 3D foundation model and refining the reasoning process through outcome-based optimization, all without requiring explicit 3D prior inputs or labeled 3D training data.

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang2026-03-09🤖 cs.AI

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

This paper introduces the Collaborative Battleship task to evaluate language models' information-seeking abilities and proposes Bayesian Experimental Design-inspired Monte Carlo inference strategies that significantly enhance both question-asking and answer-accuracy, enabling weaker models to outperform humans and frontier models in strategic decision-making tasks.

Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum2026-03-09🤖 cs.AI

The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

This paper introduces the Cultural Reference Transformation (CRT) metric to evaluate how diffusion models navigate the tension between memorization and generalization in culturally iconic contexts, revealing that model behavior depends on distinct recognition and realization mechanisms influenced by factors like data frequency, textual uniqueness, and reference popularity.

Maria-Teresa De Rosa Palmini, Eva Cetinic2026-03-09🤖 cs.AI

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

This paper introduces Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized reinforcement learning method that employs a reparameterized policy gradient of a training-free soft Q-function, enhanced by discount factors, consistency models, and off-policy replay buffers, to effectively align diffusion models with downstream objectives while mitigating reward over-optimization and preserving sample diversity.

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park2026-03-09🤖 cs.AI

XR-DT: Extended Reality-Enhanced Digital Twin for Safe Motion Planning via Human-Aware Model Predictive Path Integral Control

This paper introduces XR-DT, an Extended Reality-enhanced Digital Twin framework that integrates a novel Human-Aware Model Predictive Path Integral (HA-MPPI) controller with an attention-based trajectory prediction model to enable safe, efficient, and interpretable motion planning for mobile robots operating alongside humans.

Tianyi Wang, Jiseop Byeon, Ahmad Yehia, Yiming Xu, Jihyung Park, Tianyi Zeng, Sikai Chen, Ziran Wang, Junfeng Jiao, Christian Claudel2026-03-09🤖 cs.AI

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

This paper proposes a novel training framework that leverages the α\alpha-divergence family to explicitly filter incorrect answers and control the precision-diversity trade-off, thereby overcoming the diversity loss inherent in standard Reinforcement Learning and achieving state-of-the-art performance on the Lean theorem-proving benchmark.

Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman2026-03-09🤖 cs.AI

Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

This paper proposes a point cloud-based framework for event-driven human pose estimation that leverages spatiotemporal properties through novel temporal slicing and sequencing modules alongside an edge-enhanced representation, achieving improved accuracy and efficiency on the DHP19 dataset without converting event streams into dense frames.

Haoxian Zhou, Chuanzhi Xu, Langyi Chen, Pengfei Ye, Haodong Chen, Yuk Ying Chung, Qiang Qu2026-03-09🤖 cs.AI