cs.AI papers | Gist.Science

Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities

This paper argues for a paradigm shift in uncertainty quantification research from single-turn question-answering to interactive LLM agents by proposing a foundational framework, identifying four key technical challenges, and outlining future directions for safety-critical applications.

Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li2026-03-09🤖 cs.AI

From Features to Actions: Explainability in Traditional and Agentic AI Systems

This paper argues that traditional attribution-based explainability methods, while effective for static predictions, fail to diagnose failures in agentic AI systems, necessitating a shift toward trace-based diagnostics that reveal state tracking inconsistencies as a primary cause of execution breakdowns.

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza2026-03-09🤖 cs.AI

Towards Autonomous Mathematics Research

This paper introduces Aletheia, an autonomous AI research agent powered by advanced reasoning models and tool use that successfully generates, verifies, and revises mathematical proofs from Olympiad problems to PhD-level research, achieving milestones such as fully AI-generated papers and the autonomous solution of open problems while proposing new frameworks for quantifying AI autonomy and transparency.

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong2026-03-09🤖 cs.AI

MERIT Feedback Elicits Better Bargaining in LLM Negotiators

This paper introduces the MERIT framework, which combines the new AgoraBench benchmark, utility-theory-based metrics, and a human-preference learning pipeline to significantly enhance Large Language Models' strategic depth and bargaining performance in complex negotiation scenarios.

Jihwan Oh, Murad Aghazada, Yooju Shin, Se-Young Yun, Taehyeon Kim2026-03-09🤖 cs.AI

Why Human Guidance Matters in Collaborative Vibe Coding

Based on a controlled study of 737 participants, this paper demonstrates that while AI can optimize specific tasks, human guidance remains essential for effective collaborative "vibe coding," as human-led instruction significantly outperforms AI-led approaches and yields the best results when humans direct the process while AI handles evaluation.

Haoyu Hu, Raja Marjieh, Katherine M Collins, Chenyi Li, Thomas L. Griffiths, Ilia Sucholutsky, Nori Jacoby2026-03-09🤖 cs.AI

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

The paper introduces DataChef-32B, a reinforcement learning-based system that automates the end-to-end generation of optimal data recipes for adapting Large Language Models to specific tasks, achieving performance comparable to or exceeding human-curated pipelines and official checkpoints.

Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen2026-03-09🤖 cs.AI

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox is a lightweight, container-free framework that leverages kernel-level isolation and environment pre-caching to significantly reduce storage and setup overhead while maintaining performance comparable to traditional container-based pipelines for scaling reinforcement learning in software engineering agents.

Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao2026-03-09🤖 cs.AI

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

This paper addresses the failure of weighted-average methods in detecting multi-turn prompt injection attacks by proposing a novel "Peak + Accumulation" proxy-level scoring formula that combines peak risk, persistence, and diversity, achieving 90.8% recall at a 1.20% false positive rate without requiring an LLM.

J Alex Corll2026-03-09🤖 cs.AI

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

This systematic literature review critiques the "ground truth" paradigm in machine learning as a positivistic fallacy that misinterprets human disagreement as noise, arguing instead for pluralistic annotation infrastructures that treat diverse subjective perspectives as high-fidelity signals essential for building culturally competent models.

Sheza Munir, Benjamin Mah, Krisha Kalsi, Shivani Kapania, Julian Posada, Edith Law, Ding Wang, Syed Ishtiaque Ahmed2026-03-09🤖 cs.AI

An Adaptive Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation to Support Business Strategy and Operations

This paper introduces AHSIV, an adaptive framework that addresses horizon-induced model ranking instability in demand forecasting by integrating horizon-aware error metrics, structural demand classification, and multi-objective optimization to provide robust, operationally coherent model selection for heterogeneous business environments.

Adolfo González, Víctor Parada2026-03-09🤖 cs.AI

IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

This paper introduces IntelliAsk, a question-generation model trained via RLVR with a novel reward model (IntelliReward) and DAPO optimization to produce high-quality, evidence-based research questions that outperform human reviewers and strong baselines in expert evaluations while also enhancing broader reasoning and writing capabilities.

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari2026-03-09🤖 cs.AI

The Compute ICE-AGE: Invariant Compute Envelope under Addressable Graph Evolution

This paper presents empirical results from a production-grade C++ implementation of the Compute ICE-AGE, a deterministic semantic state substrate that achieves invariant traversal latency and thermodynamic stability by evolving a persistent addressable memory graph under bounded local operators, thereby decoupling compute costs from token volume and context horizon.

Raymond Jay Martin II2026-03-09🤖 cs.AI

FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment

The paper proposes FLoRG, a federated fine-tuning framework that utilizes single low-rank Gram matrix aggregation and Procrustes alignment to eliminate aggregation errors and decomposition drift, thereby achieving superior downstream accuracy and significantly reduced communication overhead compared to existing state-of-the-art methods.

Chuiyang Meng, Ming Tang, Vincent W. S. Wong2026-03-09🤖 cs.AI

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR $\rightarrow$ LLM Pipelines?

This paper challenges the assumption that Speech LLMs inherently outperform ASR $\rightarrow$ LLM pipelines by demonstrating through matched-backbone testing and mechanistic analysis that current Speech LLMs often function as expensive cascades relying on text representations, which can even underperform traditional pipelines under noisy conditions.

Jayadev Billa2026-03-09🤖 cs.AI

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

The paper introduces EMPO $^2$ , a hybrid reinforcement learning framework that integrates memory-augmented on- and off-policy optimization to overcome exploration bottlenecks in LLM agents, achieving significant performance gains on benchmark tasks and demonstrating superior adaptability to out-of-distribution scenarios without parameter updates.

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang2026-03-09🤖 cs.AI

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

This paper reframes the modality collapse observed in multimodal LLMs as a mismatched decoding problem, demonstrating through information-theoretic analysis and empirical validation that the accessibility of non-text information is fundamentally limited by the decoder's training objective and scoring rule rather than the encoder's architecture or alignment.

Jayadev Billa2026-03-09🤖 cs.AI

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

The paper proposes CoME, a novel mobile agent architecture that employs four specialized experts with a progressive training strategy and an InfoGain-Driven DPO method to achieve balanced, decoupled enhancement of hybrid reasoning capabilities, outperforming existing dense and MoE approaches on AITZ and AMEX datasets.

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan2026-03-09🤖 cs.AI

Theory of Code Space: Do Code Agents Understand Software Architecture?

This paper introduces Theory of Code Space (ToCS), a benchmark demonstrating that AI code agents exhibit significant, model-dependent variability in their ability to maintain coherent architectural beliefs and utilize active exploration or self-scaffolding during multi-file software engineering tasks.

Grigory Sapunov2026-03-09🤖 cs.AI

Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery

This paper proposes a reparameterized Tensor Ring functional decomposition that leverages Implicit Neural Representations and a structured basis combination to overcome the high-frequency modeling limitations of traditional methods, achieving superior performance in multi-dimensional data recovery tasks such as image inpainting and point cloud reconstruction.

Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang2026-03-09🤖 cs.AI

How Well Does Agent Development Reflect Real-World Work?

This paper reveals a significant misalignment between current AI agent development, which is heavily programming-centric, and the broader distribution of real-world human labor and economic value, prompting the proposal of new benchmarking principles to better capture socially important and technically challenging work.

Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig2026-03-09🤖 cs.AI

← Previous Next →

cs.AI