cs.CL papers | Gist.Science

How Reliable is Language Model Micro-Benchmarking?

This paper challenges the reliability of language model micro-benchmarks by demonstrating that they often fail to consistently rank models with small performance differences, frequently requiring as many as 250 examples to achieve accuracy comparable to random sampling, thereby offering actionable guidance on the trade-off between evaluation efficiency and reliability.

Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta2026-03-09🤖 cs.LG

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

This paper argues that LLM internal states primarily reflect the retrieval of parametric knowledge rather than the truthfulness of outputs, demonstrating that hallucinations driven by spurious statistical associations are mechanistically indistinguishable from factual recall, thereby limiting the effectiveness of standard detection methods.

Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng2026-03-09💬 cs.CL

Just-In-Time Objectives: A General Approach for Specialized AI Interactions

This paper introduces "Just-In-Time Objectives," a framework that passively observes user behavior to infer and rapidly optimize for specific, real-time goals, enabling large language models to generate specialized tools and responses that significantly outperform standard generic interactions.

Michelle S. Lam, Omar Shaikh, Hallie Xu, Alice Guo, Diyi Yang, Jeffrey Heer, James A. Landay, Michael S. Bernstein2026-03-09🤖 cs.AI

Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

This paper demonstrates that chain-of-thought reasoning significantly enhances large language models' ability to handle inter-sentential dependencies in English-French translation, with top-performing models like GPT-4 and Phi achieving high accuracy and showing a "wise get wiser" effect where reasoning benefits already strong models the most.

Shabnam Ataee, Hugo Huart, Andrei Popescu-Belis2026-03-09💬 cs.CL

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

This study demonstrates that ChatGPT-based coding of communication data performs consistently across gender and racial/ethnic subgroups, matching human rater reliability and validating its potential for large-scale collaborative assessments.

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi2026-03-09🤖 cs.AI

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

This paper introduces the Collaborative Battleship task to evaluate language models' information-seeking abilities and proposes Bayesian Experimental Design-inspired Monte Carlo inference strategies that significantly enhance both question-asking and answer-accuracy, enabling weaker models to outperform humans and frontier models in strategic decision-making tasks.

Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum2026-03-09🤖 cs.AI

DETECT: Determining Ease and Textual Clarity of German Text Simplifications

This paper introduces DETECT, the first German-specific metric for evaluating automatic text simplification that is trained on synthetic LLM-generated data and validated on a large human-annotated dataset, demonstrating superior correlation with human judgments across simplicity, meaning preservation, and fluency compared to existing general-purpose metrics.

Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao2026-03-09💬 cs.CL

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

This paper introduces AfriMTEB, a comprehensive benchmark covering 59 African languages across 14 tasks and 38 datasets, alongside AfriE5, a state-of-the-art text embedding model adapted for these languages through cross-lingual contrastive distillation.

Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani2026-03-09💬 cs.CL

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

This paper proposes a novel hybrid layer selection framework that extracts Big Five personality traits from LLM hidden states via low-rank subspace discovery to enable stable, precise behavioral steering without compromising fluency or general capabilities.

Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim2026-03-09💬 cs.CL

Critical Confabulation: Can LLMs Hallucinate for Social Good?

This paper proposes "critical confabulation," a framework where LLMs are guided to generate evidence-bound, speculative narratives to fill archival gaps regarding marginalized historical figures, demonstrating that controlled hallucinations can support ethical knowledge production without sacrificing historical fidelity.

Peiqi Sui, Eamon Duede, Hoyt Long, Richard Jean So2026-03-09💬 cs.CL

Co-Layout: LLM-driven Co-optimization for Interior Layout

This paper presents Co-Layout, a novel framework that integrates large language models with grid-based integer programming and a coarse-to-fine optimization strategy to jointly optimize room layouts and furniture placement, significantly outperforming existing two-stage pipelines in both solution quality and computational efficiency.

Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu, Yirui Guan, Ligang Liu2026-03-09💬 cs.CL

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

The paper proposes SPINE, a token-selective test-time reinforcement learning framework that improves reasoning model performance by updating only high-entropy decision-critical tokens with entropy-band regularization, thereby preventing response collapse and enhancing stability without requiring external labels or reward models.

Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai2026-03-09🤖 cs.LG

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

This paper identifies and systematically studies "Tools Orchestration Privacy Risk" (TOP-R), a novel vulnerability where autonomous agents inadvertently synthesize sensitive information from non-sensitive tool fragments, and addresses it by introducing the TOP-Bench benchmark, the H-Score metric, and effective mitigation strategies that significantly improve the safety-utility trade-off.

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu2026-03-09🤖 cs.AI

Window-based Membership Inference Attacks Against Fine-tuned Large Language Models

This paper introduces WBC (Window-Based Comparison), a novel membership inference attack that significantly outperforms existing global-averaging methods against fine-tuned Large Language Models by exploiting localized memorization signals through a sliding window approach with sign-based aggregation.

Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, Ninghui Li2026-03-09🤖 cs.AI

Classroom AI: Large Language Models as Grade-Specific Teachers

This paper introduces a framework for finetuning Large Language Models to generate age-appropriate, factually accurate educational content across six grade levels, which significantly improves grade-level alignment by 35.64 percentage points compared to standard prompting methods.

Jio Oh, Steven Euijong Whang, James Evans, Jindong Wang2026-03-09🤖 cs.AI

Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation

This paper introduces the PyMUSAS framework, which presents the largest evaluation of the USAS semantic tagging system across five languages and demonstrates how a hybrid approach combining rule-based methods with neural networks, trained on a newly created silver-standard dataset, significantly enhances multilingual semantic annotation performance.

Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight, Daisy Lal, Gearóid Ó Donnchadha, Mícheál Ó Meachair, Scott Piao, Elaine Uí Dhonnchadha, Johanna Vuorinen, Yan Yabo, Xiaobin Yang2026-03-09💬 cs.CL

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

This paper introduces Latent Exploration Decoding (LED), a training-free decoding strategy that leverages high-entropy intermediate layer posteriors to counteract exploration collapse in post-trained Large Reasoning Models, thereby significantly improving accuracy across multiple benchmarks.

Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan2026-03-09🤖 cs.LG

COMI: Coarse-to-fine Context Compression via Marginal Information Gain

The paper introduces COMI, a coarse-to-fine adaptive context compression framework that utilizes a novel Marginal Information Gain metric to jointly optimize semantic relevance and diversity, significantly outperforming existing baselines in long-context tasks under high compression rates.

Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng, Wenbo Su, Bo Zheng2026-03-09💬 cs.CL

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

This paper presents a collection of case studies demonstrating how researchers successfully collaborate with Google's Gemini models to solve open problems and generate new proofs in theoretical computer science and other fields, while extracting common techniques for effective human-AI partnership in scientific discovery.

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Yossi Matias, James Manyika, Vahab Mirrokni2026-03-09🤖 cs.AI

Towards Autonomous Mathematics Research

This paper introduces Aletheia, an autonomous AI research agent powered by advanced reasoning models and tool use that successfully generates, verifies, and revises mathematical proofs from Olympiad problems to PhD-level research, achieving milestones such as fully AI-generated papers and the autonomous solution of open problems while proposing new frameworks for quantifying AI autonomy and transparency.

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong2026-03-09🤖 cs.AI

← Previous Next →