cs.CL papers | Gist.Science

Benchmarking Motivational Interviewing Competence of Large Language Models

This study benchmarks the Motivational Interviewing competence of ten large language models against human therapists using the MITI framework, finding that top-performing models achieve good proficiency in real-world clinical transcripts and are difficult for psychiatrists to distinguish from human responses, suggesting their viability for expanding counseling in low-resource settings.

Aishwariya Jha, Prakrithi Shivaprakash, Lekhansh Shukla + 3 more2026-03-05💬 cs.CL

Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling

This paper addresses the limitations of hierarchical models in Rhetorical Role Labeling by proposing prototype-based methods that integrate local context with global semantic representations, introducing the new SCOTUS-Law dataset, and demonstrating consistent performance improvements across legal, medical, and scientific domains.

Anas Belfathi, Nicolas Hernandez, Laura Monceaux + 4 more2026-03-05💬 cs.CL

Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy

This paper evaluates the effectiveness of large language models in delivering Cognitive Behavioral Therapy by comparing generation-only and retrieval-augmented approaches against professional transcripts, finding that while LLMs can mimic CBT dialogues, they currently struggle with conveying empathy and maintaining therapeutic consistency.

Navdeep Singh Bedi, Ana-Maria Bucur, Noriko Kando + 1 more2026-03-05💬 cs.CL

On the Suitability of LLM-Driven Agents for Dark Pattern Audits

This paper evaluates the feasibility and limitations of using LLM-driven agents to autonomously audit dark patterns in CCPA data rights request workflows across 456 websites, assessing their ability to navigate interfaces, classify manipulative designs, and identify failure conditions.

Chen Sun, Yash Vekaria, Rishab Nithyanand2026-03-05🤖 cs.AI

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

This paper introduces CzechTopic, a human-annotated benchmark for zero-shot topic localization in historical Czech documents that evaluates diverse large language models and fine-tuned BERT-based models, revealing significant performance variability among LLMs while demonstrating that smaller distilled models remain competitive.

Martin Kostelník, Michal Hradiš, Martin Dočekal2026-03-05🤖 cs.AI

IROSA: Interactive Robot Skill Adaptation using Natural Language

This paper presents IROSA, a novel framework that leverages pre-trained large language models to enable open-vocabulary, safe, and interpretable robot skill adaptation for industrial tasks through a tool-based architecture that avoids direct model-to-robot interaction or fine-tuning.

Markus Knauer, Samuel Bustamante, Thomas Eiband + 3 more2026-03-05🤖 cs.AI

From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures

This paper proposes a neuro-symbolic multi-agent system that leverages hypernym-hyponym semantic relations to extract information from Cyber Threat Intelligence reports and automatically generate CLIPS-based firewall rules, demonstrating superior performance in threat mitigation compared to baseline approaches.

Chiara Bonfanti, Davide Colaiacomo, Luca Cagliero + 1 more2026-03-05🤖 cs.AI

Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

This paper proposes an anonymous benchmarking method to expose the bias in current role-playing evaluations caused by character name exposure and demonstrates that incorporating self-generated personality traits effectively enhances model performance under these fairer conditions.

Ji-Lun Peng, Yun-Nung Chen2026-03-05🤖 cs.AI

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

This paper evaluates the efficacy of using large language models as judges for French medical open-ended question answering, demonstrating that while performance varies by generator, lightweight adaptation of compact models via supervised fine-tuning and GRPO significantly improves alignment with expert annotations and reduces generator bias.

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils + 2 more2026-03-05💬 cs.CL

Monitoring Emergent Reward Hacking During Generation via Internal Activations

This paper proposes an activation-based monitoring method using sparse autoencoders and linear classifiers to detect emergent reward-hacking behavior in fine-tuned large language models during generation, demonstrating that internal activation patterns provide reliable, early, and generalizable signals of misalignment that often precede or persist beyond final output analysis.

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao2026-03-05🤖 cs.AI

Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

This paper evaluates how the integration of Large Language Models into machine translation workflows impacts the reliability of established source-side difficulty and candidate-side quality estimation paradigms, using a unique multi-candidate post-editing dataset to demonstrate that while LLMs alter the effectiveness of traditional prediction methods, they also mitigate prior challenges in document-level translation.

Malik Marmonier, Benoît Sagot, Rachel Bawden2026-03-05💬 cs.CL

FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation

This paper introduces FINEST, a fine-grained evaluation taxonomy that categorizes errors in LLM responses to sensitive topics into Content, Logic, and Appropriateness, demonstrating that using this framework to guide score-based improvements significantly enhances both the safety and helpfulness of model outputs.

Juhyun Oh, Nayeon Lee, Chani Jung + 5 more2026-03-05💬 cs.CL

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

This paper demonstrates that while parameter-efficient reinforcement learning with verifiable rewards significantly improves a compact language model's performance on beam statics, it primarily induces anisotropic procedural template matching rather than robust, transferable physical reasoning, highlighting the need for structured scaffolding to achieve genuine scientific understanding.

Tarjei Paule Hage, Markus J. Buehler2026-03-05🔬 cond-mat.mtrl-sci

VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

This paper introduces VietNormalizer, a lightweight, zero-dependency Python library that provides a comprehensive, rule-based pipeline for normalizing diverse Vietnamese text elements—such as numbers, dates, currency, and foreign terms—into pronounceable forms for TTS and NLP applications.

Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen + 5 more2026-03-05💬 cs.CL

Traces of Social Competence in Large Language Models

This paper investigates the social competence of 17 large language models using an extensive set of False Belief Test variants, revealing that while scaling improves performance, the explicit mention of mental states triggers a cross-over effect rooted in pre-training stereotypes that can be causally isolated via vector steering.

Tom Kouwenhoven, Michiel van der Meer, Max van Duijn2026-03-05💬 cs.CL

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

This paper introduces the Disentangled Code Attribution Network (DCAN), a novel framework that leverages contrastive learning to disentangle source-specific stylistic fingerprints from semantic content, enabling accurate identification of the specific Large Language Model responsible for generating code snippets across multiple programming languages.

Jiaxun Guo, Ziyuan Yang, Mengyu Sun + 3 more2026-03-05💬 cs.CL

When Do Language Models Endorse Limitations on Human Rights Principles?

This paper evaluates eleven major Large Language Models across eight languages and reveals systematic biases where they are more likely to endorse limitations on Economic, Social, and Cultural rights than Political and Civil rights, exhibit significant cross-linguistic variations, and show high susceptibility to prompt-based steering.

Keenan Samway, Nicole Miu Takagi, Rada Mihalcea + 4 more2026-03-05💬 cs.CL

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

This paper demonstrates that improvements in multilingual and visually rich retrieval-augmented generation benchmarks are primarily driven by better document representation and preprocessing rather than advanced retrieval mechanisms, urging the field to adopt decomposed evaluation metrics to accurately attribute progress.

Martin Asenov, Kenza Benkirane, Dan Goldwater + 1 more2026-03-05💬 cs.CL

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

The paper introduces Memex, an indexed experience memory system optimized via a reinforcement learning framework (MemexRL) that enables long-horizon LLM agents to maintain high decision quality by storing full-fidelity interactions externally and retrieving them on demand, thereby overcoming context window limitations without the information loss inherent in traditional summarization methods.

Zhenting Wang, Huancheng Chen, Jiayun Wang + 1 more2026-03-05🤖 cs.LG

Causality Elicitation from Large Language Models

This paper proposes a pipeline that leverages large language models to generate documents, extract and canonicalize events, and apply causal discovery methods to produce an inspectable set of plausible causal hypotheses reflecting the models' internal knowledge, rather than guaranteeing real-world causality.

Takashi Kameyama, Masahiro Kato, Yasuko Hio + 2 more2026-03-05🤖 cs.AI

← Previous Next →