Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

This paper demonstrates through controlled experiments that a human-in-the-loop approach significantly outperforms iterative chain-of-thought prompting in improving behavioral interview answer quality, offering superior gains in confidence and authenticity with fewer iterations by prioritizing context availability over computational resources.

Kewen Zhu, Zixi Liu, Yanjing Li2026-03-12💬 cs.CL

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

This study evaluates the robustness and pedagogical safety of offline large language models for Turkish heritage language education using a custom anomaly suite, finding that reasoning-oriented models in the 8B–14B parameter range offer the optimal balance between cost and safety while demonstrating that anomaly resistance is not strictly dependent on model scale.

Edibe Yilmaz, Kahraman Kostas2026-03-12💬 cs.CL

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

This study refutes the claim that newer AI models have lost empathy, demonstrating through clinical assessment that while empathetic responses remain statistically consistent across generations, users' perception of "lost empathy" actually stems from a significant shift toward heightened crisis detection and altered safety postures that make the models appear more intrusive during vulnerable moments.

Michael Keeman, Anastasia Keeman2026-03-12💬 cs.CL

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

This paper presents an automated evaluation framework using semantic and sentiment analysis to assess Mandarin-to-English machine translation across news and literary texts, revealing that while LLMs like GPT-4o and DeepSeek excel in news translation and semantic conservation, they still struggle with preserving cultural subtleties, classical references, and figurative expressions in literary works.

Yue Zhang, Rodney Beard, John Hawkins, Rohitash Chandra2026-03-12💬 cs.CL

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

This paper introduces LatamQA, a geographically informed sociocultural bias dataset of over 26,000 multilingual multiple-choice questions derived from Wikidata and Wikipedia, which reveals that current large language models exhibit significant performance disparities across Latin American countries, favoring Iberian Spanish culture and their original training languages.

Yannis Karmim (ALMAnaCH), Renato Pino (UCHILE), Hernan Contreras (UCHILE), Hernan Lira (CENIA), Sebastian Cifuentes (CENIA), Simon Escoffier (PUC), Luis Martí (UP4, ALPAGE), Djamé Seddah (UP4, ALPAGE), Valentin Barrière (UCHILE, CENIA)2026-03-12💬 cs.CL

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

This paper introduces SpreadsheetArena, a platform for evaluating large language models' end-to-end spreadsheet generation capabilities through blind pairwise comparisons, revealing that while models can produce functional workbooks, they often fail to align with domain-specific best practices and that user preferences vary significantly across different use cases.

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling2026-03-12💬 cs.CL

SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

The paper introduces SENS-ASR, a streaming automatic speech recognition approach that improves transcription quality under low-latency constraints by injecting semantic information extracted from past frame-embeddings via a context module trained through knowledge distillation from a fine-tuned language model.

Youness Dkhissi (LIUM), Valentin Vielzeuf (LIUM), Elys Allesiardo (LIUM), Anthony Larcher (LIUM)2026-03-12💬 cs.CL

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

This paper introduces Personalized Group Relative Policy Optimization (P-GRPO), a novel framework that improves alignment with diverse individual preferences by decoupling advantage estimation from batch statistics and normalizing rewards against preference-group-specific histories, thereby overcoming the limitations of standard GRPO in handling heterogeneous user signals.

Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani2026-03-12🤖 cs.LG

Measuring and Eliminating Refusals in Military Large Language Models

This paper introduces a novel gold-standard dataset developed by US military veterans to quantify excessive safety refusals in military Large Language Models, demonstrating that while specialized fine-tuning can significantly reduce these refusals, achieving zero refusals and maximum accuracy requires deeper, end-to-end specialization.

Jack FitzGerald, Dylan Bates, Aristotelis Lazaridis, Aman Sharma, Vincent Lu, Brian King, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Joseph Madigan, Jeremy McLaurin, Luke Kerbs, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman2026-03-12💬 cs.CL

Assessing Cognitive Biases in LLMs for Judicial Decision Support: Virtuous Victim and Halo Effects

This study evaluates five large language models for judicial sentencing support and finds that while they exhibit a stronger virtuous victim effect and lack a significant penalty for adjacent consent compared to humans, they generally demonstrate reduced prestige-based halo effects, particularly regarding credentials, though current variability still limits their immediate deployment in legal settings.

Sierra S. Liu2026-03-12💻 cs

Defining AI Models and AI Systems: A Framework to Resolve the Boundary Problem

This paper addresses the regulatory ambiguity surrounding "AI models" and "AI systems" by proposing clear conceptual and operational definitions that distinguish trained parameters from broader system components, thereby facilitating the precise allocation of obligations across the AI value chain.

Yuanyuan Sun, Timothy Parker, Lara Gierschmann, Sana Shams, Teo Canmetin, Mathieu Duteil, Rokas Gipiškis, Ze Shen Chin2026-03-12🤖 cs.AI

A Governance and Evaluation Framework for Deterministic, Rule-Based Clinical Decision Support in Empiric Antibiotic Prescribing

This paper proposes a governance and evaluation framework for deterministic, rule-based clinical decision support systems in empiric antibiotic prescribing that prioritizes transparency, auditability, and conservative behavior by formally separating decision logic from scope constraints and utilizing synthetic case validation to ensure behavioral alignment with predefined rules.

Francisco José Gárate, Paloma Chausa, Diego Moreno, Judit López Luque, Vicens Díaz-Brito, Enrique Javier Gómez2026-03-12🤖 cs.AI