Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

This paper experimentally evaluates whether Large Language Models summarize well-known books better using their internal training knowledge or by processing the full text, revealing that while full-text access generally yields more detailed summaries, internal knowledge can sometimes outperform it, thereby challenging the reliability of long-context summarization.

Tairan Fu, Javier Conde, Pedro Reviriego, Javier Coronado-Blázquez, Nina Melero, Elena Merino-GómezThu, 12 Ma💬 cs.CL

Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

This paper presents a pipeline that bridges mechanistic interpretability and natural language explanations by identifying causally important attention heads in GPT-2 Small, generating high-quality explanations via LLMs, and evaluating their faithfulness to reveal that while explanations can be sufficient, they often lack comprehensiveness due to distributed backup mechanisms.

Ajay Pravin MahaleThu, 12 Ma💬 cs.CL

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

The paper introduces the System Hallucination Scale (SHS), a lightweight, human-centered psychometric instrument validated by 210 participants to rapidly evaluate Large Language Models' hallucination-related behaviors from a user perspective, distinct from automatic detection metrics.

Heimo Müller, Dominik Steiger, Markus Plass, Andreas HolzingerThu, 12 Ma💬 cs.CL

PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

This paper introduces PoultryLeX-Net, a domain-adaptive dual-stream transformer framework enhanced with poultry-specific lexicons and topic modeling, which significantly outperforms existing baseline models in accurately analyzing stakeholder sentiment within the global poultry industry.

Stephen Afrifa, Biswash Khatiwada, Kapalik Khanal, Sanjay Shah, Lingjuan Wang-Li, Ramesh Bahadur BistThu, 12 Ma💬 cs.CL

TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

This paper introduces TAMUSA-Chat, a research-oriented framework that enables academic institutions to build responsible, domain-adapted conversational AI systems through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation, while providing a publicly available codebase to support reproducible experimentation and ethical deployment.

Izzat Alsmadi, Anas AlsobehThu, 12 Ma💬 cs.CL

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

This paper introduces the Contextual Emotional Inference (CEI) Benchmark, a dataset of 300 human-validated scenarios designed to evaluate large language models' ability to infer intended meaning beyond literal semantics by navigating ambiguous utterances across diverse power dynamics and pragmatic subtypes.

Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin IdowuThu, 12 Ma💬 cs.CL

Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

This paper demonstrates through controlled experiments that a human-in-the-loop approach significantly outperforms iterative chain-of-thought prompting in improving behavioral interview answer quality, offering superior gains in confidence and authenticity with fewer iterations by prioritizing context availability over computational resources.

Kewen Zhu, Zixi Liu, Yanjing LiThu, 12 Ma💬 cs.CL

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

This study evaluates the robustness and pedagogical safety of offline large language models for Turkish heritage language education using a custom anomaly suite, finding that reasoning-oriented models in the 8B–14B parameter range offer the optimal balance between cost and safety while demonstrating that anomaly resistance is not strictly dependent on model scale.

Edibe Yilmaz, Kahraman KostasThu, 12 Ma💬 cs.CL

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

This study refutes the claim that newer AI models have lost empathy, demonstrating through clinical assessment that while empathetic responses remain statistically consistent across generations, users' perception of "lost empathy" actually stems from a significant shift toward heightened crisis detection and altered safety postures that make the models appear more intrusive during vulnerable moments.

Michael Keeman, Anastasia KeemanThu, 12 Ma💬 cs.CL

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

This paper presents an automated evaluation framework using semantic and sentiment analysis to assess Mandarin-to-English machine translation across news and literary texts, revealing that while LLMs like GPT-4o and DeepSeek excel in news translation and semantic conservation, they still struggle with preserving cultural subtleties, classical references, and figurative expressions in literary works.

Yue Zhang, Rodney Beard, John Hawkins, Rohitash ChandraThu, 12 Ma💬 cs.CL