cs.CL papers | Gist.Science

Explainable LLM Unlearning Through Reasoning

This paper proposes Targeted Reasoning Unlearning (TRU), a novel framework that utilizes a reasoning-based unlearning target to guide models in precisely removing specific undesirable knowledge while preserving general capabilities and enhancing robustness against attacks.

Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen FangThu, 12 Ma🤖 cs.LG

Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

This paper experimentally evaluates whether Large Language Models summarize well-known books better using their internal training knowledge or by processing the full text, revealing that while full-text access generally yields more detailed summaries, internal knowledge can sometimes outperform it, thereby challenging the reliability of long-context summarization.

Tairan Fu, Javier Conde, Pedro Reviriego, Javier Coronado-Blázquez, Nina Melero, Elena Merino-GómezThu, 12 Ma💬 cs.CL

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

This paper introduces AraModernBERT, an Arabic adaptation of the ModernBERT encoder that leverages transtokenized initialization and native long-context modeling up to 8,192 tokens to achieve significant improvements in both language modeling and downstream discriminative tasks.

Omar Elshehy, Omer Nacar, Abdelbasset Djamai, Muhammed Ragab, Khloud Al Jallad, Mona AbdelazimThu, 12 Ma💬 cs.CL

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

MoE-SpAc is an efficient inference framework for Mixture-of-Experts models on heterogeneous edge devices that repurposes speculative decoding as a predictive sensor for memory management, achieving significant throughput improvements through dynamic workload balancing and asynchronous execution.

Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu YeThu, 12 Ma🤖 cs.LG

An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

This paper proposes a hybrid deep learning model integrating BERT, CNN, and LSTM architectures with ReLU activation to effectively detect abusive language across diverse online platforms, achieving approximately 99% performance on a large, imbalanced dataset of over 349,000 text samples.

Vuong M. Ngo, Cach N. Dang, Kien V. Nguyen, Mark RoantreeThu, 12 Ma💬 cs.CL

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

This empirical study demonstrates that large language models exhibit a Dunning-Kruger-like cognitive bias, where poorly performing models display significantly higher overconfidence and worse calibration than their more accurate counterparts.

Sudipta Ghosh, Mrityunjoy PandayThu, 12 Ma💬 cs.CL

Quantifying Hallucinations in Language Language Models on Medical Textbooks

This paper quantifies hallucinations in medical textbook-based QA by demonstrating that LLaMA-70B-Instruct hallucinated in nearly 20% of answers despite high plausibility, and found that lower hallucination rates generally correlate with higher clinician-rated usefulness across models.

Brandon C. Colelough, Davis Bartels, Dina Demner-FushmanThu, 12 Ma💬 cs.CL

Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

This paper proposes a closed-loop framework that optimizes Large Language Model-driven Feature Transformation by evolving and selecting diverse, task-verified transformation trajectories via chain-of-thought reasoning, thereby outperforming existing methods in generating effective feature operators for downstream predictive tasks.

Xinyuan Wang, Kunpeng Liu, Arun Vignesh Malarkkan, Yanjie FuThu, 12 Ma💬 cs.CL

Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

This paper presents a pipeline that bridges mechanistic interpretability and natural language explanations by identifying causally important attention heads in GPT-2 Small, generating high-quality explanations via LLMs, and evaluating their faithfulness to reveal that while explanations can be sufficient, they often lack comprehensiveness due to distributed backup mechanisms.

Ajay Pravin MahaleThu, 12 Ma💬 cs.CL

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

The paper introduces the System Hallucination Scale (SHS), a lightweight, human-centered psychometric instrument validated by 210 participants to rapidly evaluate Large Language Models' hallucination-related behaviors from a user perspective, distinct from automatic detection metrics.

Heimo Müller, Dominik Steiger, Markus Plass, Andreas HolzingerThu, 12 Ma💬 cs.CL

A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

This paper proposes a two-stage architecture combining LLaMA-3.1-8B-Instruct for NDA segmentation and a fine-tuned Legal-Roberta-Large for clause classification, achieving high precision (ROUGE F1 of 0.95 and weighted F1 of 0.85) to automate the analysis of complex Non-Disclosure Agreements.

Ana Begnini, Matheus Vicente, Leonardo SouzaThu, 12 Ma💬 cs.CL

PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

This paper introduces PoultryLeX-Net, a domain-adaptive dual-stream transformer framework enhanced with poultry-specific lexicons and topic modeling, which significantly outperforms existing baseline models in accurately analyzing stakeholder sentiment within the global poultry industry.

Stephen Afrifa, Biswash Khatiwada, Kapalik Khanal, Sanjay Shah, Lingjuan Wang-Li, Ramesh Bahadur BistThu, 12 Ma💬 cs.CL

TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

This paper introduces TAMUSA-Chat, a research-oriented framework that enables academic institutions to build responsible, domain-adapted conversational AI systems through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation, while providing a publicly available codebase to support reproducible experimentation and ethical deployment.

Izzat Alsmadi, Anas AlsobehThu, 12 Ma💬 cs.CL

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

This paper introduces the Contextual Emotional Inference (CEI) Benchmark, a dataset of 300 human-validated scenarios designed to evaluate large language models' ability to infer intended meaning beyond literal semantics by navigating ambiguous utterances across diverse power dynamics and pragmatic subtypes.

Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin IdowuThu, 12 Ma💬 cs.CL

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

This paper evaluates adjective-noun compositionality in large language models using both functional and representational approaches, revealing a significant divergence where models successfully develop compositional internal representations but fail to consistently translate them into functional task success.

Ruchira Dhar, Qiwei Peng, Anders SøgaardThu, 12 Ma💬 cs.CL

Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

This paper demonstrates through controlled experiments that a human-in-the-loop approach significantly outperforms iterative chain-of-thought prompting in improving behavioral interview answer quality, offering superior gains in confidence and authenticity with fewer iterations by prioritizing context availability over computational resources.

Kewen Zhu, Zixi Liu, Yanjing LiThu, 12 Ma💬 cs.CL

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

This study evaluates the robustness and pedagogical safety of offline large language models for Turkish heritage language education using a custom anomaly suite, finding that reasoning-oriented models in the 8B–14B parameter range offer the optimal balance between cost and safety while demonstrating that anomaly resistance is not strictly dependent on model scale.

Edibe Yilmaz, Kahraman KostasThu, 12 Ma💬 cs.CL

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

This study refutes the claim that newer AI models have lost empathy, demonstrating through clinical assessment that while empathetic responses remain statistically consistent across generations, users' perception of "lost empathy" actually stems from a significant shift toward heightened crisis detection and altered safety postures that make the models appear more intrusive during vulnerable moments.

Michael Keeman, Anastasia KeemanThu, 12 Ma💬 cs.CL

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

This paper presents an automated evaluation framework using semantic and sentiment analysis to assess Mandarin-to-English machine translation across news and literary texts, revealing that while LLMs like GPT-4o and DeepSeek excel in news translation and semantic conservation, they still struggle with preserving cultural subtleties, classical references, and figurative expressions in literary works.

Yue Zhang, Rodney Beard, John Hawkins, Rohitash ChandraThu, 12 Ma💬 cs.CL

A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

This paper presents a retrieval-augmented language assistant designed to support unmanned aircraft safety assessments and regulatory compliance by grounding responses in authoritative sources to ensure traceable, auditable, and citation-driven decision support without replacing expert judgment.

Gabriele Immordino, Andrea Vaiuso, Marcello RighiThu, 12 Ma💬 cs.CL

← Previous Next →