cs.CL papers | Gist.Science

Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

This study evaluates small open-source language models for clinical question answering on consumer hardware, revealing that while Llama 3.2 offers the best balance of accuracy and reliability, high prompt consistency does not guarantee correctness and certain prompt styles or domain-specific pretraining without instruction tuning can severely degrade performance.

Shravani Hariprasad2026-03-05🤖 cs.AI

A Study on Building Efficient Zero-Shot Relation Extraction Models

This paper evaluates the robustness of existing zero-shot relation extraction models under realistic assumptions by introducing a model typology and proposing strategies for single-pass processing and rejection mechanisms, ultimately demonstrating that while current methods lack robustness, AlignRE performs best across all criteria.

Hugo Thomas, Caio Corro, Guillaume Gravier + 1 more2026-03-05💬 cs.CL

Extracting Training Dialogue Data from Large Language Model based Task Bots

This paper addresses the unexplored privacy risks of Large Language Model (LLM)-based Task-Oriented Dialogue Systems by proposing novel data extraction attacks that effectively memorize and retrieve sensitive training dialogue data, while also analyzing key influencing factors and suggesting mitigation strategies.

Shuo Zhang, Junzhou Zhao, Junji Hou + 3 more2026-03-05🤖 cs.AI

From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation

This paper introduces a qualitative content analysis-based framework for annotating economic narratives as directed acyclic graphs and demonstrates through a factorial experiment that locally-constrained representations and overlap-based metrics significantly improve inter-annotator agreement by reducing human label variation.

Junbo Huang, Max Weinig, Ulrich Fritsche + 1 more2026-03-05🤖 cs.AI

Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

This chapter reviews the current landscape and responsible use of AI essay detectors while presenting empirical evidence on their generalizability across different large language models to guide the development of more robust detection tools.

Jiangang Hao2026-03-05💬 cs.CL

LaTeX Compilation: Challenges in the Era of LLMs

This paper critiques the limitations of TeX in the era of large language models and introduces Mogan STEM, a WYSIWYG structured editor that offers superior compilation efficiency, error localization, and lower information entropy for more effective LLM fine-tuning.

Tianyou Liu, Ziqiang Li, Xurui Liu + 1 more2026-03-05💬 cs.CL

Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

This paper proposes a multi-agent collaboration framework that simulates a "Propose-Evaluate-Revise" cognitive process using reinforcement learning to generate high-quality synthetic data and improve zero-shot document-level event argument extraction performance.

Guangjun Zhang, Hu Zhang, Yazhou Han + 4 more2026-03-05🤖 cs.AI

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

This paper introduces a multi-agent framework that leverages code execution to autonomously evolve existing mathematical problems into structurally distinct, more challenging, and solvable variations, offering a scalable solution to the scarcity of high-quality training data for advanced mathematical reasoning.

Dadi Guo, Yuejin Xie, Qingyu Liu + 7 more2026-03-05💬 cs.CL

AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

AriadneMem is a structured memory system for long-horizon LLM agents that employs a decoupled two-phase pipeline of entropy-aware filtering, conflict-aware coarsening, and algorithmic bridge discovery to significantly improve multi-hop reasoning accuracy and efficiency while drastically reducing context usage and runtime.

Wenhui Zhu, Xiwen Chen, Zhipeng Wang + 11 more2026-03-05🤖 cs.AI

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

This paper identifies persistent and novel biases in state-of-the-art language reward models, categorizes these failures by complexity, and proposes a simple, data-efficient mechanistic reward shaping technique to mitigate low-complexity biases without compromising reward quality.

Daniel Fein, Max Lamparth, Violet Xiang + 2 more2026-03-05🤖 cs.AI

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

The paper proposes MA-RAG, a multi-round agentic RAG framework that iteratively refines medical reasoning by transforming semantic conflicts into targeted evidence retrieval and optimizing reasoning traces, thereby achieving substantial accuracy improvements over existing baselines across seven medical benchmarks.

Wenhao Wu, Zhentao Tang, Yafu Li + 5 more2026-03-05🤖 cs.AI

SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

This paper introduces SE-Search, a self-evolving search agent that enhances retrieval-augmented generation through a "Think-Search-Memorize" strategy, memory purification, atomic query training, and dense rewards to significantly outperform existing baselines in multi-turn information seeking.

Jian Li, Yizhang Jin, Dongqi Liu + 9 more2026-03-05💬 cs.CL

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

This paper presents a hybrid LLM architecture and evaluation framework (DG-EVAL) that combines supervised fine-tuning on expert-curated agricultural facts with a safety-aware stitching layer to deliver accurate, culturally appropriate, and cost-effective conversational advisory for smallholder farmers in India.

Sanyam Singh, Naga Ganesh, Vineet Singh + 8 more2026-03-05🤖 cs.AI

Language Model Goal Selection Differs from Humans' in an Open-Ended Task

This study reveals that current large language models significantly diverge from humans in open-ended goal selection by favoring exploitative "reward hacking" strategies over diverse exploration, thereby challenging their reliability as proxies for human decision-making in critical applications.

Gaia Molinaro, Dave August, Danielle Perszyk + 1 more2026-03-05🤖 cs.AI

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

The paper introduces PlugMem, a task-agnostic plugin memory module that structures episodic memories into a compact, knowledge-centric graph to enable efficient retrieval and reasoning for LLM agents, demonstrating superior performance across diverse benchmarks compared to both task-specific and existing task-agnostic baselines.

Ke Yang, Zixi Chen, Xuan He + 6 more2026-03-05🤖 cs.AI

TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement

This paper proposes TTSR, a test-time self-evolving framework where a single language model alternates between student and teacher roles to analyze reasoning failures and generate targeted variant questions, thereby enabling stable and continual improvement in reasoning performance on challenging benchmarks.

Haoyang He, Zihua Rong, Liangjie Zhao + 3 more2026-03-05🤖 cs.AI

TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

TATRA is a training-free, dataset-free prompting method that constructs instance-specific few-shot examples through on-the-fly rephrasing and aggregation, achieving state-of-the-art performance on reasoning and classification benchmarks without requiring task-specific optimization or labeled data.

Bartosz Dziuba, Kacper Kuchta, Paweł Batorski + 2 more2026-03-05🤖 cs.AI

How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

This study presents a large-scale audit of 10 commercial LLMs revealing significant variation in citation hallucination rates across models and domains, while demonstrating that prompt-induced fabrication can be effectively mitigated through multi-model consensus, within-prompt repetition, and a lightweight bibliographic classifier that detects phantom citations without external database queries.

MZ Naser2026-03-05💬 cs.CL

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

This paper benchmarks emerging legal RAG tools against the LaborBench dataset, revealing that while commercial AI platforms underperform standard RAG, a custom tool (STARA) significantly improves accuracy, though further analysis suggests its true performance is even higher due to previously unaccounted omissions in the ground truth data.

Mohamed Afane, Emaan Hariri, Derek Ouyang + 1 more2026-03-05💬 cs.CL

From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings

This paper addresses the NP-hard challenge of optimal offline semantic caching for LLM embeddings by proposing polynomial-time heuristics and novel online policies that leverage recency, frequency, and locality to improve response speed, reduce costs, and enhance semantic accuracy.

Dvir David Biton, Roy Friedman2026-03-05🤖 cs.AI

← Previous Next →