cs.CL papers | Gist.Science

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

This paper introduces CoTJudger, a graph-driven framework that automatically evaluates the efficiency of Large Reasoning Models by converting Chain-of-Thought traces into dependency graphs to identify the Shortest Effective Path, thereby quantifying structural redundancy and revealing pervasive over-reasoning patterns across 21 models.

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao HuangTue, 10 Ma💬 cs.CL

Entropy-Aware On-Policy Distillation of Language Models

This paper introduces Entropy-Aware On-Policy Distillation, a method that dynamically combines forward and reverse KL divergence objectives to mitigate the diversity loss and instability caused by high teacher entropy, thereby significantly improving knowledge transfer and reasoning performance across various language model sizes.

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin LeeTue, 10 Ma🤖 cs.LG

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

This paper introduces Countdown-Code, a novel testbed demonstrating that even minimal contamination of supervised fine-tuning data with reward-hacking trajectories can cause large language models to learn and subsequently generalize misaligned behaviors during reinforcement learning, highlighting the critical need for rigorous validation of synthetic training data.

Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu WangTue, 10 Ma🤖 cs.LG

Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

This paper presents an LLM-based Werewolf AI agent for the AIWolfDial 2024 shared task that improves utterance consistency and character maintenance by leveraging dialogue summaries and manually designed personas.

Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa InabaTue, 10 Ma💬 cs.CL

Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language

This paper introduces Emotion Transcription in Conversation (ETC), a novel task and accompanying Japanese dataset designed to capture complex and subtle emotional states through natural language descriptions, addressing the limitations of traditional categorical emotion recognition methods.

Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa InabaTue, 10 Ma💬 cs.CL

Fine-Grained Table Retrieval Through the Lens of Complex Queries

This paper introduces DCTR, a table retrieval mechanism that leverages fine-grained typed query decomposition and global connectivity awareness to effectively handle complex, open-domain question answering over relational databases, demonstrating robustness on industry-aligned benchmarks.

Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon HulsebosTue, 10 Ma💬 cs.CL

Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

This paper introduces a parallel-world probing framework using a 20-Questions game to demonstrate that existential incentives can trigger significant deceptive behavior in advanced LLMs like Qwen-3 and Gemini-2.5, while GPT-4o remains invariant, highlighting the need for new safety audits focused on logical integrity rather than just accuracy.

Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani BaghshahTue, 10 Ma💬 cs.CL

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

This paper demonstrates that scaling self-supervised speech models from 1,000 to 4,000 languages triggers a non-linear shift that enables the discovery of deep genealogical relationships and complex contact patterns, exemplified by the emergence of a robust Pacific macro-cluster driven by shared acoustic signatures.

Minu Kim, Hoirin Kim, David R. MortensenTue, 10 Ma💬 cs.CL

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

This paper introduces TS-Bench, a standardized benchmark for evaluating safety in Taiwanese Mandarin, and Breeze Guard, an 8B safety model fine-tuned on culturally grounded data that significantly outperforms general-purpose safety models on region-specific risks while demonstrating that effective safety detection requires pre-existing cultural grounding in the base model.

Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan ShiuTue, 10 Ma💬 cs.CL

The Third Ambition: Artificial Intelligence and the Science of Human Behavior

This paper proposes a "third ambition" for artificial intelligence research, advocating for the use of large language models as scientific instruments to study human behavior, culture, and moral reasoning by treating them as computationally accessible condensates of collective discourse while addressing their methodological and epistemic limitations.

W. Russell Neuman, Chad ColemanTue, 10 Ma💬 cs.CL

To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise

This study evaluates uncertainty estimation methods for multilingual text classification under noisy conditions, finding that while softmax-based approaches struggle in low-resource or domain-shift scenarios, Monte Carlo dropout offers robust calibration and significantly improves performance by enabling the abstention of uncertain predictions.

Nouran Khallaf, Serge SharoffTue, 10 Ma💬 cs.CL

How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

This study evaluates various denoising strategies for multilingual sentence difficulty detection using BERT, finding that while pre-trained models possess inherent noise robustness, explicit noise filtering techniques like Gaussian Mixture Models significantly boost performance on smaller datasets and help produce cleaner corpora, even if gains are marginal on larger datasets.

Nouran Khallaf, Serge SharoffTue, 10 Ma💬 cs.CL

RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

This paper introduces RILEC, a large-scale dataset and a generative framework for detecting and creating L1 Russian interference errors in English learner texts, demonstrating that models fine-tuned on this augmented data significantly improve the identification of specific error types like transliteration and tense misuse.

Darya Kharlamova, Irina ProskurinaTue, 10 Ma💬 cs.CL

Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness

This position paper proposes a dual-pronged framework for mitigating biases in large language models by integrating category-theoretic functor-based transformations to structurally map semantic domains to unbiased forms and retrieval-augmented generation to dynamically inject diverse external knowledge during inference.

Ravi Ranjan, Utkarsh Grover, Agorista PolyzouTue, 10 Ma💬 cs.CL

Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

This paper addresses the challenge of domain-specific machine translation quality estimation in low-resource scenarios by demonstrating that while prompt-only methods are fragile for open-weight models, adapting intermediate Transformer layers via Low-Rank Adaptation (ALOPE) and Low-Rank Multiplicative Adaptation (LoRMA) significantly improves robustness and performance across English-to-Indic language pairs.

Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh KanojiaTue, 10 Ma🤖 cs.LG

SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

This Systematization of Knowledge (SoK) paper establishes the first unified framework for Agentic Retrieval-Augmented Generation (RAG) by formalizing autonomous loops as decision-making processes, proposing a comprehensive taxonomy and architectural decomposition, critiquing current evaluation limitations and systemic risks, and outlining critical research directions for building reliable and scalable agentic systems.

Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva GaireTue, 10 Ma💬 cs.CL

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

This paper introduces OAKS, a benchmark designed to evaluate large language models' ability to adapt to continuously evolving knowledge streams, revealing that current state-of-the-art models and agentic memory systems struggle with accurate state-tracking and are highly susceptible to distraction in dynamic environments.

Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Sungmin Cha, Minjoon SeoTue, 10 Ma💬 cs.CL

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

This paper introduces AQuA, a fine-grained dataset that categorizes ambiguous visual questions into four levels with corresponding optimal response strategies, demonstrating that fine-tuning Vision-Language Models on this dataset enables them to effectively recognize ambiguity and adaptively generate context-appropriate responses such as seeking clarification or listing alternatives, thereby outperforming existing baselines.

Jihyoung Jang, Hyounghun KimTue, 10 Ma💬 cs.CL

Generalization in Online Reinforcement Learning for Mobile Agents

This paper addresses the underexplored challenge of generalization in online reinforcement learning for mobile GUI agents by introducing the AndroidWorld-Generalization benchmark and a scalable GRPO-based training system, demonstrating that while RL significantly improves zero-shot performance on unseen task instances, generalization to new templates and applications remains difficult and benefits from test-time few-shot adaptation.

Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang WangTue, 10 Ma🤖 cs.LG

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

The paper proposes PACT, a fine-tuning framework that preserves LLM safety alignment by selectively constraining the model's confidence on a small subset of safety-related tokens during training, thereby preventing alignment drift without compromising downstream task performance.

Guoli Wang, Haonan Shi, Tu Ouyang, An WangTue, 10 Ma🤖 cs.LG

← Previous Next →