Planner Aware Path Learning in Diffusion Language Models Training

This paper addresses the training-inference mismatch in diffusion language models caused by planner-based sampling strategies by deriving a new Planned Evidence Lower Bound (P-ELBO) and introducing Planner Aware Path Learning (PAPL), a simple training modification that aligns training with planned inference to achieve significant performance gains across protein, text, and code generation tasks.

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Alexander Tong, Avishek Joey Bose2026-03-09🤖 cs.LG

Diffusion Alignment as Variational Expectation-Maximization

The paper introduces Diffusion Alignment as Variational Expectation-Maximization (DAV), an iterative framework that alternates between test-time search for diverse, reward-aligned samples and model refinement to optimize diffusion models for downstream objectives while mitigating reward over-optimization and mode collapse.

Jaewoo Lee, Minsu Kim, Sanghyeok Choi, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Woocheol Shin, Taeyoung Yun, Kiyoung Om, Jinkyoo Park2026-03-09🤖 cs.LG

Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits

This paper addresses the online minimization of polarization and disagreement in the Friedkin-Johnsen opinion dynamics model under incomplete information by proposing a two-stage low-rank matrix bandit algorithm that achieves a cumulative regret of O~(max(1κ,V)VT)\widetilde{\mathcal{O}}\big(\max(\tfrac{1}{\kappa},\sqrt{|V|})\sqrt{|V|T}\big) through subspace estimation and linear bandit optimization.

Federico Cinus, Yuko Kuroki, Atsushi Miyauchi, Francesco Bonchi2026-03-09🤖 cs.LG

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

This paper demonstrates that while standard decoder-only models underperform compared to encoder-only architectures in cross-modal adaptation for partial differential equations, introducing novel bidirectionality-mimicking techniques like Parallel Flipping and Sequence Doubling effectively closes this performance gap.

Paloma García-de-Herreros, Philipp Slusallek, Dietrich Klakow, Vagrant Gautam2026-03-09🤖 cs.LG

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

This paper demonstrates that injecting external verification into synthetic data retraining can prevent model collapse and yield near-term improvements, though theoretical analysis and experiments across linear regression, VAEs, and LLMs show that long-term performance ultimately converges to the verifier's knowledge center and may plateau or decline if the verifier is imperfect.

Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu2026-03-09🤖 cs.LG

Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning

This paper presents a real-time online framework that utilizes modified sliding-window Hankel Dynamic Mode Decomposition with singular-value hard thresholding and Cadzow projection to denoise partial measurements and construct predictive models for dynamic obstacle motion, enabling stable, variance-aware forecasting suitable for robotic motion planning.

Stella Kombo, Masih Haseli, Skylar X. Wei, Joel W. Burdick2026-03-09🤖 cs.LG

FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

The paper introduces FireScope, a novel VLM-based framework and accompanying FireScope-Bench dataset that leverage chain-of-thought reasoning to significantly improve the generalization, interpretability, and accuracy of cross-continental wildfire risk prediction by integrating visual, climatic, and geographic factors.

Mario Markov (INSAIT, Sofia University "St. Kliment Ohridski"), Stefan Maria Ailuro (INSAIT, Sofia University "St. Kliment Ohridski"), Luc Van Gool (INSAIT, Sofia University "St. Kliment Ohridski"), Konrad Schindler (ETH Zurich), Danda Pani Paudel (INSAIT, Sofia University "St. Kliment Ohridski")2026-03-09🤖 cs.LG

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

The paper proposes SPINE, a token-selective test-time reinforcement learning framework that improves reasoning model performance by updating only high-entropy decision-critical tokens with entropy-band regularization, thereby preventing response collapse and enhancing stability without requiring external labels or reward models.

Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai2026-03-09🤖 cs.LG

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

This paper introduces Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized reinforcement learning method that employs a reparameterized policy gradient of a training-free soft Q-function, enhanced by discount factors, consistency models, and off-policy replay buffers, to effectively align diffusion models with downstream objectives while mitigating reward over-optimization and preserving sample diversity.

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park2026-03-09🤖 cs.AI

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

This paper proposes a novel training framework that leverages the α\alpha-divergence family to explicitly filter incorrect answers and control the precision-diversity trade-off, thereby overcoming the diversity loss inherent in standard Reinforcement Learning and achieving state-of-the-art performance on the Lean theorem-proving benchmark.

Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman2026-03-09🤖 cs.AI