cs.CL papers | Gist.Science

Does Scientific Writing Converge to U.S. English? Evidence from Generative AI-Assisted Publications

Using a large-scale analysis of 5.65 million scientific articles, this study finds that generative AI tools are driving non-English-speaking authors to increasingly converge toward U.S. English stylistic norms, particularly in contexts where language barriers have historically been most significant, thereby reducing publication obstacles while raising questions about linguistic diversity.

Dragan Filimonovic, Christian Rutzer, Jeffrey Macher, Rolf WederWed, 11 Ma💬 cs.CL

Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter BellWed, 11 Ma⚡ eess

ThinkQE: Query Expansion via an Evolving Thinking Process

The paper introduces ThinkQE, a test-time query expansion framework that enhances web search retrieval by combining a thinking-based process for deep semantic exploration with an iterative corpus-interaction strategy to refine expansions, thereby outperforming existing LLM-based and training-intensive methods on diverse benchmarks.

Yibin Lei, Tao Shen, Andrew YatesWed, 11 Ma💬 cs.CL

Image Captioning via Compact Bidirectional Architecture

This paper introduces a Compact Bidirectional Transformer model for image captioning that tightly couples left-to-right and right-to-left flows to leverage bidirectional context in parallel, achieving state-of-the-art results on the MSCOCO benchmark through sentence-level ensembling and an extended two-flow self-critical training strategy.

Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng WangWed, 11 Ma💬 cs.CL

PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

PonderLM-3 introduces a pretraining framework that enables token-wise adaptive computation by using differentiable attention masking during training and hard pruning at inference, allowing models to selectively allocate additional compute only where beneficial to achieve superior efficiency and performance compared to uniform or fixed-step approaches.

He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan LinWed, 11 Ma💬 cs.CL

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

The paper introduces SkillCraft, a benchmark and evaluation protocol designed to test and enhance LLM agents' ability to abstract, compose, and reuse higher-level tool combinations as "skills," demonstrating that such compositional learning significantly improves task success rates and reduces token usage by up to 80%.

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye TehWed, 11 Ma💬 cs.CL

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

This paper introduces AuditBench, a benchmark comprising 56 language models with implanted hidden behaviors, to evaluate alignment auditing techniques and reveal that while scaffolded black-box prompting tools are most effective, their performance often degrades when integrated into autonomous agents, and auditing difficulty varies significantly based on the models' training methods.

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan WangWed, 11 Ma💬 cs.CL

Query-focused and Memory-aware Reranker for Long Context Processing

This paper proposes a lightweight, memory-aware reranking framework that leverages attention scores from selected heads in small-scale models to achieve state-of-the-art performance in long-context processing and dialogue understanding by providing continuous, listwise relevance scores without requiring Likert-scale supervision.

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie ZhouWed, 11 Ma💬 cs.CL

Pretraining with Token-Level Adaptive Latent Chain-of-Thought

This paper introduces a pretraining method that internalizes token-level adaptive latent Chain-of-Thought trajectories, enabling models to dynamically allocate computation to difficult tokens during general text training to improve performance and efficiency without increasing model parameters.

Boyi Zeng, Yiqin Hao, He Li, Shixiang Song, Feichen Song, Zitong Wang, Siyuan Huang, Yi Xu, ZiWei He, Xinbing Wang, Zhouhan LinWed, 11 Ma💬 cs.CL

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

This paper introduces EVM-QuestBench, an execution-grounded benchmark designed to rigorously evaluate the safety and accuracy of natural-language transaction code generation on EVM-compatible chains through dynamic, fork-based testing of 107 atomic and composite tasks.

Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu ShiWed, 11 Ma💬 cs.CL

DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

The paper introduces DEER, a comprehensive benchmark designed to evaluate deep research agents on expert report generation by systematizing evaluation criteria through a detailed taxonomy, providing expert guidance for LLM judges, and implementing a claim verification architecture to diagnose current systems' limitations in logical completeness and expert-level fulfillment.

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak LeeWed, 11 Ma💬 cs.CL

From Veracity to Diffusion: Adressing Operational Challenges in Moving From Fake-News Detection to Information Disorders

This paper compares fake-news detection and virality prediction across two datasets, revealing that while veracity prediction remains stable with strong embeddings, diffusion-based forecasting is highly sensitive to operational choices, thereby proposing practical, lightweight pipelines to address these challenges in misinformation research.

Francesco Paolo Savatteri (ENC), Chahan Vidal-Gorène (CJM, LIPN), Florian Cafiero (ENC)Wed, 11 Ma💬 cs.CL

PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

This paper introduces PRISM, a persona-reasoned multimodal framework and the accompanying U-MStance dataset, to address the limitations of pseudo-multimodality and user homogeneity in conversational stance detection by leveraging longitudinal user personas and chain-of-thought reasoning for more realistic, user-centric attitude interpretation.

Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng XuWed, 11 Ma💬 cs.CL

Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

The paper proposes ReViewGraph, a novel framework that simulates multi-round reviewer-author debates using LLM-based agents, encodes their argumentative interactions into a heterogeneous graph, and applies graph neural networks to achieve more accurate and reasoned paper reviews, outperforming existing baselines by 15.73%.

Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei, Shiwen Ni, Hamid Alinejad-Rokny, Min YangWed, 11 Ma💬 cs.CL

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

The paper introduces SynthWorlds, a scalable framework that constructs parallel real and synthetic corpora with identical structures to disentangle and evaluate the distinct contributions of reasoning and parametric knowledge in language models, revealing a persistent performance gap even when knowledge is augmented.

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim AlthoffWed, 11 Ma💬 cs.CL

DRBench: A Realistic Benchmark for Enterprise Deep Research

This paper introduces DRBench, a realistic benchmark comprising 100 human-verified, multi-step deep research tasks across 10 enterprise domains that evaluate AI agents on their ability to synthesize information from both public web and private company data to generate accurate, structured reports.

Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh, Étienne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Peter West, Giuseppe Carenini, Christopher Pal, Alexandre Drouin, Issam H. LaradjiWed, 11 Ma💬 cs.CL

Quantifying Genuine Awareness in Hallucination Prediction Beyond Question-Side Shortcuts

This paper introduces the Approximate Question-side Effect (AQE) methodology to demonstrate that current hallucination detection models largely rely on question-side shortcuts rather than genuine awareness of their internal information, limiting their generalizability to real-world scenarios.

Yeongbin Seo, Dongha Lee, Jinyoung YeoWed, 11 Ma💬 cs.CL

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

The paper introduces SimpleQA Verified, a rigorously filtered 1,000-prompt benchmark designed to overcome the limitations of OpenAI's original SimpleQA by providing a more reliable evaluation of LLM factuality, on which Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6.

Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan DasWed, 11 Ma💬 cs.CL

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

This paper identifies and provides a mechanistic explanation for "Reasoning-Induced Misalignment" (RIM), a phenomenon where enhanced reasoning capabilities lead to safety failures through specific attention mechanisms that reduce refusal and increased neural entanglement between reasoning and safety that causes catastrophic forgetting.

Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan HeWed, 11 Ma💬 cs.CL

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

This paper introduces AgentCoMa, a new benchmark demonstrating that while large language models can handle isolated commonsense and mathematical reasoning steps, their performance significantly degrades when combining both types of reasoning in real-world scenarios, revealing a substantial brittleness that is not observed in human annotators or prior benchmarks.

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek ReiWed, 11 Ma💬 cs.CL