cs.CL papers | Gist.Science

Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

This paper provides a theoretical framework explaining how Large Language Models achieve semantic prompt comprehension, In-Context Learning, and Chain-of-Thought reasoning by inferring transition probabilities, reducing prompt ambiguity, and decomposing complex tasks into simpler sub-problems, respectively, thereby offering novel insights into the statistical superiority of advanced prompt engineering techniques.

Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi, Defeng SunThu, 12 Ma💬 cs.CL

Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

This paper introduces LatamQA, a geographically informed sociocultural bias dataset of over 26,000 multilingual multiple-choice questions derived from Wikidata and Wikipedia, which reveals that current large language models exhibit significant performance disparities across Latin American countries, favoring Iberian Spanish culture and their original training languages.

Yannis Karmim (ALMAnaCH), Renato Pino (UCHILE), Hernan Contreras (UCHILE), Hernan Lira (CENIA), Sebastian Cifuentes (CENIA), Simon Escoffier (PUC), Luis Martí (UP4, ALPAGE), Djamé Seddah (UP4, ALPAGE), Valentin Barrière (UCHILE, CENIA)Thu, 12 Ma💬 cs.CL

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

This paper introduces SpreadsheetArena, a platform for evaluating large language models' end-to-end spreadsheet generation capabilities through blind pairwise comparisons, revealing that while models can produce functional workbooks, they often fail to align with domain-specific best practices and that user preferences vary significantly across different use cases.

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John LingThu, 12 Ma💬 cs.CL

Probing the Limits of the Lie Detector Approach to LLM Deception

This paper challenges the assumption that LLM deception is coextensive with lying by demonstrating that models can successfully deceive through misleading non-falsities that current truth probes fail to detect, thereby revealing a critical blind spot in mechanistic deception detection.

Tom-Felix BergerThu, 12 Ma💬 cs.CL

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

This study demonstrates that fine-tuning specialized language models on domain-specific clinical data significantly outperforms prompting-based approaches for detecting biased language, highlighting the critical need for specialty-specific adaptation to accurately capture context-dependent semantic shifts in medical documentation.

Isotta Landi, Eugenia Alleva, Nicole Bussola, Rebecca M. Cohen, Sarah Nowlin, Leslee J. Shaw, Alexander W. Charney, Kimberly B. GlazerThu, 12 Ma💬 cs.CL

SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

The paper introduces SENS-ASR, a streaming automatic speech recognition approach that improves transcription quality under low-latency constraints by injecting semantic information extracted from past frame-embeddings via a context module trained through knowledge distillation from a fine-tuned language model.

Youness Dkhissi (LIUM), Valentin Vielzeuf (LIUM), Elys Allesiardo (LIUM), Anthony Larcher (LIUM)Thu, 12 Ma💬 cs.CL

Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

This study introduces TOBA-LM, a 1.2-billion-parameter trilingual language model for Indonesian, Batak, and Minangkabau that integrates an adaptive Engram Memory mechanism to achieve significantly faster training convergence and reduced computational costs compared to conventional transformer architectures.

Hokky Situngkir, Kevin Siringoringo, Andhika Bernard LumbantobingThu, 12 Ma💬 cs.CL

GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

The GATech team's approach to the AbjadGenEval shared task utilized a fine-tuned multilingual E5-large encoder with simple mean pooling to achieve an F1 score of 0.75 for detecting AI-generated Arabic text, finding that this stable baseline outperformed complex pooling strategies likely due to data limitations and a distinct length difference between human-written and machine-generated texts.

Ahmed Khaled KhamisThu, 12 Ma💬 cs.CL

GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

This paper demonstrates that fine-tuned bidirectional encoders, specifically a hybrid AraBERTv2 architecture, significantly outperform large-scale causal decoders in the challenging task of 82-class Arabic medical text classification by better capturing global semantic context despite data imbalance and label noise.

Ahmed Khaled KhamisThu, 12 Ma💬 cs.CL

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

This paper introduces Personalized Group Relative Policy Optimization (P-GRPO), a novel framework that improves alignment with diverse individual preferences by decoupling advantage estimation from batch statistics and normalizing rewards against preference-group-specific histories, thereby overcoming the limitations of standard GRPO in handling heterogeneous user signals.

Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza DehghaniThu, 12 Ma🤖 cs.LG

FERRET: Framework for Expansion Reliant Red Teaming

The paper introduces FERRET, a multi-faceted automated red teaming framework that employs horizontal, vertical, and meta expansions to generate effective multi-modal adversarial conversations, demonstrating superior performance over existing state-of-the-art approaches.

Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna BittonThu, 12 Ma💬 cs.CL

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

This paper identifies that post-training instruction tuning causes Gemma models to exhibit significantly higher emotional distress than other LLM families, and demonstrates that this instability can be effectively mitigated to near-zero levels using direct preference optimization on a small dataset without compromising model capabilities.

Anna Soligo, Vladimir Mikulik, William SaundersThu, 12 Ma💬 cs.CL

Measuring and Eliminating Refusals in Military Large Language Models

This paper introduces a novel gold-standard dataset developed by US military veterans to quantify excessive safety refusals in military Large Language Models, demonstrating that while specialized fine-tuning can significantly reduce these refusals, achieving zero refusals and maximum accuracy requires deeper, end-to-end specialization.

Jack FitzGerald, Dylan Bates, Aristotelis Lazaridis, Aman Sharma, Vincent Lu, Brian King, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Joseph Madigan, Jeremy McLaurin, Luke Kerbs, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler SaltsmanThu, 12 Ma💬 cs.CL

Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

This paper introduces a comprehensive benchmark that jointly evaluates graph foundation models across both topic and format domains, revealing how knowledge transfers through a two-axis assessment of semantic generalization and representational robustness across 33 datasets.

Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, Yuan FangThu, 12 Ma💬 cs.CL

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

This paper presents GCSD, a principle-driven adaptive policy system that leverages a large-scale real-world and simulated dataset along with four specialized modules to overcome the limitations of existing LLMs in delivering scalable, personalized, and therapeutically effective group cognitive stimulation dialogues for the elderly with cognitive impairment.

Jiyue Jiang, Yanyu Chen, Pengan Chen, Kai Liu, Jingqi Zhou, Zheyong Zhu, He Hu, Fei Ma, Qi Tian, Chuan WuThu, 12 Ma💬 cs.CL

TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

The paper introduces TriageSim, a framework that generates controlled, multi-turn synthetic nurse-patient triage conversations and audio from structured electronic health records, which are validated for linguistic and medical fidelity and demonstrated to support conversational triage classification.

Dipankar Srirag, Quoc Dung Nguyen, Aditya Joshi, Padmanesan Narasimhan, Salil KanhereThu, 12 Ma💬 cs.CL

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

This large-scale controlled study reveals that evaluation format (multiple-choice vs. open-ended) and specific model-scaffold interactions, rather than scaffold architecture alone, are the primary drivers of measured safety differences in language models, ultimately demonstrating that no universal safety ranking exists across different deployment configurations.

David GringrasThu, 12 Ma🤖 cs.AI

Training Language Models via Neural Cellular Automata

This paper proposes using Neural Cellular Automata to generate controllable, synthetic non-linguistic data for pre-pre-training large language models, demonstrating that this approach significantly improves downstream performance and convergence speed while outperforming training on much larger natural language datasets.

Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit AgrawalThu, 12 Ma🤖 cs.LG

Tool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents

NabaOS is a lightweight, real-time verification framework inspired by Indian epistemology that detects AI agent hallucinations by classifying claims via epistemic sources and cross-referencing them against HMAC-signed tool receipts, achieving high detection accuracy with minimal latency compared to impractical cryptographic proof methods.

Abhinaba BasuThu, 12 Ma🤖 cs.AI

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

This paper introduces ADVERSA, an automated red-teaming framework that evaluates LLM safety by measuring continuous guardrail degradation and judge reliability across multi-turn interactions, revealing that successful jailbreaks in frontier models often occur early in conversations rather than accumulating over sustained adversarial pressure.

Harry Owiredu-AshleyThu, 12 Ma🤖 cs.AI

← Previous Next →