Benchmarking Motivational Interviewing Competence of Large Language Models

This study benchmarks the Motivational Interviewing competence of ten large language models against human therapists using the MITI framework, finding that top-performing models achieve good proficiency in real-world clinical transcripts and are difficult for psychiatrists to distinguish from human responses, suggesting their viability for expanding counseling in low-resource settings.

Aishwariya Jha, Prakrithi Shivaprakash, Lekhansh Shukla + 3 more2026-03-05💬 cs.CL

Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling

This paper addresses the limitations of hierarchical models in Rhetorical Role Labeling by proposing prototype-based methods that integrate local context with global semantic representations, introducing the new SCOTUS-Law dataset, and demonstrating consistent performance improvements across legal, medical, and scientific domains.

Anas Belfathi, Nicolas Hernandez, Laura Monceaux + 4 more2026-03-05💬 cs.CL

From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures

This paper proposes a neuro-symbolic multi-agent system that leverages hypernym-hyponym semantic relations to extract information from Cyber Threat Intelligence reports and automatically generate CLIPS-based firewall rules, demonstrating superior performance in threat mitigation compared to baseline approaches.

Chiara Bonfanti, Davide Colaiacomo, Luca Cagliero + 1 more2026-03-05🤖 cs.AI

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

This paper evaluates the efficacy of using large language models as judges for French medical open-ended question answering, demonstrating that while performance varies by generator, lightweight adaptation of compact models via supervised fine-tuning and GRPO significantly improves alignment with expert annotations and reduces generator bias.

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils + 2 more2026-03-05💬 cs.CL

Monitoring Emergent Reward Hacking During Generation via Internal Activations

This paper proposes an activation-based monitoring method using sparse autoencoders and linear classifiers to detect emergent reward-hacking behavior in fine-tuned large language models during generation, demonstrating that internal activation patterns provide reliable, early, and generalizable signals of misalignment that often precede or persist beyond final output analysis.

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao2026-03-05🤖 cs.AI

Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

This paper evaluates how the integration of Large Language Models into machine translation workflows impacts the reliability of established source-side difficulty and candidate-side quality estimation paradigms, using a unique multi-candidate post-editing dataset to demonstrate that while LLMs alter the effectiveness of traditional prediction methods, they also mitigate prior challenges in document-level translation.

Malik Marmonier, Benoît Sagot, Rachel Bawden2026-03-05💬 cs.CL

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

This paper demonstrates that while parameter-efficient reinforcement learning with verifiable rewards significantly improves a compact language model's performance on beam statics, it primarily induces anisotropic procedural template matching rather than robust, transferable physical reasoning, highlighting the need for structured scaffolding to achieve genuine scientific understanding.

Tarjei Paule Hage, Markus J. Buehler2026-03-05🔬 cond-mat.mtrl-sci

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

The paper introduces Memex, an indexed experience memory system optimized via a reinforcement learning framework (MemexRL) that enables long-horizon LLM agents to maintain high decision quality by storing full-fidelity interactions externally and retrieving them on demand, thereby overcoming context window limitations without the information loss inherent in traditional summarization methods.

Zhenting Wang, Huancheng Chen, Jiayun Wang + 1 more2026-03-05🤖 cs.LG