Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

This paper argues that LLM internal states primarily reflect the retrieval of parametric knowledge rather than the truthfulness of outputs, demonstrating that hallucinations driven by spurious statistical associations are mechanistically indistinguishable from factual recall, thereby limiting the effectiveness of standard detection methods.

Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng2026-03-09💬 cs.CL

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

This paper introduces "Just-In-Time Objectives," a framework that passively observes user behavior to infer and rapidly optimize for specific, real-time goals, enabling large language models to generate specialized tools and responses that significantly outperform standard generic interactions.

Michelle S. Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A. Landay, Michael S. Bernstein2026-03-09🤖 cs.AI

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

This paper introduces the Collaborative Battleship task to evaluate language models' information-seeking abilities and proposes Bayesian Experimental Design-inspired Monte Carlo inference strategies that significantly enhance both question-asking and answer-accuracy, enabling weaker models to outperform humans and frontier models in strategic decision-making tasks.

Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum2026-03-09🤖 cs.AI

DETECT: Determining Ease and Textual Clarity of German Text Simplifications

This paper introduces DETECT, the first German-specific metric for evaluating automatic text simplification that is trained on synthetic LLM-generated data and validated on a large human-annotated dataset, demonstrating superior correlation with human judgments across simplicity, meaning preservation, and fluency compared to existing general-purpose metrics.

Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao2026-03-09💬 cs.CL

Co-Layout: LLM-driven Co-optimization for Interior Layout

This paper presents Co-Layout, a novel framework that integrates large language models with grid-based integer programming and a coarse-to-fine optimization strategy to jointly optimize room layouts and furniture placement, significantly outperforming existing two-stage pipelines in both solution quality and computational efficiency.

Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu, Yirui Guan, Ligang Liu2026-03-09💬 cs.CL

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

The paper proposes SPINE, a token-selective test-time reinforcement learning framework that improves reasoning model performance by updating only high-entropy decision-critical tokens with entropy-band regularization, thereby preventing response collapse and enhancing stability without requiring external labels or reward models.

Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai2026-03-09🤖 cs.LG

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

This paper identifies and systematically studies "Tools Orchestration Privacy Risk" (TOP-R), a novel vulnerability where autonomous agents inadvertently synthesize sensitive information from non-sensitive tool fragments, and addresses it by introducing the TOP-Bench benchmark, the H-Score metric, and effective mitigation strategies that significantly improve the safety-utility trade-off.

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu2026-03-09🤖 cs.AI

Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation

This paper introduces the PyMUSAS framework, which presents the largest evaluation of the USAS semantic tagging system across five languages and demonstrates how a hybrid approach combining rule-based methods with neural networks, trained on a newly created silver-standard dataset, significantly enhances multilingual semantic annotation performance.

Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight, Daisy Lal, Gearóid Ó Donnchadha, Mícheál Ó Meachair, Scott Piao, Elaine Uí Dhonnchadha, Johanna Vuorinen, Yan Yabo, Xiaobin Yang2026-03-09💬 cs.CL

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

This paper introduces Latent Exploration Decoding (LED), a training-free decoding strategy that leverages high-entropy intermediate layer posteriors to counteract exploration collapse in post-trained Large Reasoning Models, thereby significantly improving accuracy across multiple benchmarks.

Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan2026-03-09🤖 cs.LG

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

This paper presents a collection of case studies demonstrating how researchers successfully collaborate with Google's Gemini models to solve open problems and generate new proofs in theoretical computer science and other fields, while extracting common techniques for effective human-AI partnership in scientific discovery.

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Yossi Matias, James Manyika, Vahab Mirrokni2026-03-09🤖 cs.AI

Towards Autonomous Mathematics Research

This paper introduces Aletheia, an autonomous AI research agent powered by advanced reasoning models and tool use that successfully generates, verifies, and revises mathematical proofs from Olympiad problems to PhD-level research, achieving milestones such as fully AI-generated papers and the autonomous solution of open problems while proposing new frameworks for quantifying AI autonomy and transparency.

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong2026-03-09🤖 cs.AI