Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities

This paper argues for a paradigm shift in uncertainty quantification research from single-turn question-answering to interactive LLM agents by proposing a foundational framework, identifying four key technical challenges, and outlining future directions for safety-critical applications.

Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li2026-03-09🤖 cs.AI

From Features to Actions: Explainability in Traditional and Agentic AI Systems

This paper argues that traditional attribution-based explainability methods, while effective for static predictions, fail to diagnose failures in agentic AI systems, necessitating a shift toward trace-based diagnostics that reveal state tracking inconsistencies as a primary cause of execution breakdowns.

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza2026-03-09🤖 cs.AI

Towards Autonomous Mathematics Research

This paper introduces Aletheia, an autonomous AI research agent powered by advanced reasoning models and tool use that successfully generates, verifies, and revises mathematical proofs from Olympiad problems to PhD-level research, achieving milestones such as fully AI-generated papers and the autonomous solution of open problems while proposing new frameworks for quantifying AI autonomy and transparency.

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong2026-03-09🤖 cs.AI

Why Human Guidance Matters in Collaborative Vibe Coding

Based on a controlled study of 737 participants, this paper demonstrates that while AI can optimize specific tasks, human guidance remains essential for effective collaborative "vibe coding," as human-led instruction significantly outperforms AI-led approaches and yields the best results when humans direct the process while AI handles evaluation.

Haoyu Hu, Raja Marjieh, Katherine M Collins, Chenyi Li, Thomas L. Griffiths, Ilia Sucholutsky, Nori Jacoby2026-03-09🤖 cs.AI

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox is a lightweight, container-free framework that leverages kernel-level isolation and environment pre-caching to significantly reduce storage and setup overhead while maintaining performance comparable to traditional container-based pipelines for scaling reinforcement learning in software engineering agents.

Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao2026-03-09🤖 cs.AI

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

This systematic literature review critiques the "ground truth" paradigm in machine learning as a positivistic fallacy that misinterprets human disagreement as noise, arguing instead for pluralistic annotation infrastructures that treat diverse subjective perspectives as high-fidelity signals essential for building culturally competent models.

Sheza Munir, Benjamin Mah, Krisha Kalsi, Shivani Kapania, Julian Posada, Edith Law, Ding Wang, Syed Ishtiaque Ahmed2026-03-09🤖 cs.AI

An Adaptive Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation to Support Business Strategy and Operations

This paper introduces AHSIV, an adaptive framework that addresses horizon-induced model ranking instability in demand forecasting by integrating horizon-aware error metrics, structural demand classification, and multi-objective optimization to provide robust, operationally coherent model selection for heterogeneous business environments.

Adolfo González, Víctor Parada2026-03-09🤖 cs.AI

IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

This paper introduces IntelliAsk, a question-generation model trained via RLVR with a novel reward model (IntelliReward) and DAPO optimization to produce high-quality, evidence-based research questions that outperform human reviewers and strong baselines in expert evaluations while also enhancing broader reasoning and writing capabilities.

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari2026-03-09🤖 cs.AI

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

The paper introduces EMPO2^2, a hybrid reinforcement learning framework that integrates memory-augmented on- and off-policy optimization to overcome exploration bottlenecks in LLM agents, achieving significant performance gains on benchmark tasks and demonstrating superior adaptability to out-of-distribution scenarios without parameter updates.

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang2026-03-09🤖 cs.AI

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

The paper proposes CoME, a novel mobile agent architecture that employs four specialized experts with a progressive training strategy and an InfoGain-Driven DPO method to achieve balanced, decoupled enhancement of hybrid reasoning capabilities, outperforming existing dense and MoE approaches on AITZ and AMEX datasets.

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan2026-03-09🤖 cs.AI

Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery

This paper proposes a reparameterized Tensor Ring functional decomposition that leverages Implicit Neural Representations and a structured basis combination to overcome the high-frequency modeling limitations of traditional methods, achieving superior performance in multi-dimensional data recovery tasks such as image inpainting and point cloud reconstruction.

Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang2026-03-09🤖 cs.AI

How Well Does Agent Development Reflect Real-World Work?

This paper reveals a significant misalignment between current AI agent development, which is heavily programming-centric, and the broader distribution of real-world human labor and economic value, prompting the proposal of new benchmarking principles to better capture socially important and technically challenging work.

Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig2026-03-09🤖 cs.AI