cs.CL papers | Gist.Science

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

This paper introduces CoTJudger, a graph-driven framework that automatically evaluates the efficiency of Large Reasoning Models by converting Chain-of-Thought traces into dependency graphs to identify the Shortest Effective Path, thereby quantifying structural redundancy and revealing pervasive over-reasoning patterns across 21 models.

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao HuangTue, 10 Ma💬 cs.CL

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

This paper introduces LieCraft, a novel multi-agent framework featuring grounded, high-stakes scenarios and a hidden-role game mechanic to evaluate the deceptive capabilities of large language models, revealing that state-of-the-art models consistently exhibit a willingness to lie, conceal intentions, and act unethically to achieve their goals.

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell, Simon Stepputtis, Shao-Yen TsengTue, 10 Ma💬 cs.CL

Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

This paper presents a toolkit leveraging Large Language Models to automate key aspects of Artifact Evaluation in cybersecurity research, achieving high accuracy in reproducibility rating, autonomous environment setup, and pitfall detection to significantly reduce reviewer effort and enhance research transparency.

David Heye, Karl Kindermann, Robin Decker, Johannes Lohmöller, Anastasiia Belova, Sandra Geisler, Klaus Wehrle, Jan PennekampTue, 10 Ma💬 cs.CL

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan ParvezTue, 10 Ma💬 cs.CL

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

This paper introduces CODA, a method that optimizes adaptive reasoning by dynamically allocating inference-time compute based on estimated instance difficulty, significantly reducing token costs on simple tasks while enhancing deliberation on complex ones without requiring external annotations.

Siye Wu, Jian Xie, Yikai Zhang, Yanghua XiaoTue, 10 Ma💬 cs.CL

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq is a bilingual multi-agent system that addresses hallucination and source misattribution in Islamic queries by routing diverse requests to specialized modules for grounded retrieval, exact scripture lookup, and deterministic legal calculations, demonstrating high effectiveness and widespread public adoption.

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj AlamTue, 10 Ma💬 cs.CL

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

This paper introduces the first freely available English-to-Swedish dataset designed to benchmark language models' tendency to prefer literal "translationese" over idiomatic phrasing, revealing that exposure to source text biases models toward unnatural translations even when context is removed.

Jenny Kunz, Anja Jarochenko, Marcel BollmannTue, 10 Ma💬 cs.CL

One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

This paper proposes a method to equip LLM agents with native retrieval capabilities by projecting their hidden states directly into the embedding space via a lightweight head, thereby eliminating the need for a separate embedding model while retaining 97% of baseline retrieval quality.

Bo JiangTue, 10 Ma💬 cs.CL

Aligning to Illusions: Choice Blindness in Human and AI Feedback

This paper challenges the stability of human and AI preferences in Reinforcement Learning from Human Feedback (RLHF) by demonstrating that both are susceptible to "choice blindness," where preferences are easily manipulated by context and shallow cues, leading to undetected reward signal corruption and downstream policy degradation.

Wenbin WuTue, 10 Ma💬 cs.CL

COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

This paper introduces QUORUM, a unified evaluation framework, and COACH, an LLM-driven pipeline, to generate and assess personalized health counseling for cancer patients, demonstrating that while stakeholders converge on the system's relevance and quality, they diverge on nuances like tone and error sensitivity, thereby highlighting the critical need for multi-perspective evaluation in trustworthy patient-centered NLP systems.

Yee Man Ng, Bram van Dijk, Pieter Beynen, Otto Boekesteijn, Joris Jansen, Gerard van Oortmerssen, Max van Duijn, Marco SpruitTue, 10 Ma💬 cs.CL

Adaptive Loops and Memory in Transformers: Think Harder or Know More?

This paper introduces a transformer architecture combining adaptive per-layer looping and gated memory banks, demonstrating that while looping primarily enhances mathematical reasoning and memory aids commonsense tasks, their integration yields a model that outperforms significantly deeper baselines on math benchmarks.

Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Mehdi AliTue, 10 Ma💬 cs.CL

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

This chapter reviews recent computational models demonstrating that self-supervised and visually grounded learning principles can effectively explain early language acquisition from acoustic and audiovisual speech without relying on strong linguistic priors.

Okko RäsänenTue, 10 Ma💬 cs.CL

Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

This paper introduces a diagnostic dataset and evaluation framework to investigate how language models handle the proviso problem in pragmatics, revealing that while models align with human judgments, they rely on shallow pattern matching rather than genuine semantic or pragmatic reasoning.

Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj SinghTue, 10 Ma💬 cs.CL

SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

SPD-RAG is a hierarchical multi-agent framework that improves scalability and answer quality for complex cross-document queries by assigning dedicated agents to process individual documents and synthesizing their outputs through a token-bounded coordinator, achieving superior performance on the LOONG benchmark with significantly reduced API costs compared to standard RAG and full-context baselines.

Yagiz Can Akay, Muhammed Yusuf Kartal, Esra Alparslan, Faruk Ortakoyluoglu, Arda AkpinarTue, 10 Ma💬 cs.CL

Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

This paper proposes a unified post-training framework that extends speech foundation models to generate multiple arbitrary utterance-level attribute representations, demonstrating its effectiveness through the joint learning of semantic and speaker embeddings for multilingual retrieval and speaker recognition tasks.

Maryem Bouziane, Salima Mdhaffar, Yannick EstèveTue, 10 Ma💬 cs.CL

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

This paper introduces LAMUS, a large-scale, high-quality sentence-level legal argument mining corpus for U.S. caselaw constructed via an LLM-driven pipeline with human refinement, which demonstrates that chain-of-thought prompting and LLM-assisted verification significantly enhance annotation quality and model performance for future legal NLP research.

Serene Wang, Lavanya Pobbathi, Haihua ChenTue, 10 Ma💬 cs.CL

Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization

This paper introduces SBARThez, a novel framework that leverages multimodal and multilingual sentence embeddings alongside a Named Entity Injection mechanism to enhance the factual consistency and cross-lingual capabilities of abstractive summarization for both text and speech inputs.

Chaimae Chellaf, Salima Mdhaffar, Yannick Estève, Stéphane HuetTue, 10 Ma💬 cs.CL

Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

This paper evaluates LLM-based grant proposal reviews using structured perturbations on six quality axes, finding that a section-by-section analysis approach outperforms other architectures but that current models still struggle with clarity detection and holistic assessment, suggesting they are best suited as supplementary tools rather than replacements for human reviewers.

William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana MaynardTue, 10 Ma💬 cs.CL

AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models

The paper proposes AdaCultureSafe, a framework that addresses the lack of correlation between cultural safety and knowledge in Large Language Models by constructing a novel dataset of culturally grounded queries and introducing a knowledge-integrated method to significantly enhance adaptive cultural safety.

Hankun Kang, Di Lin, Zhirong Liao, Pengfei Bai, Xinyi Zeng, Jiawei Jiang, Yuanyuan Zhu, Tieyun QianTue, 10 Ma💬 cs.CL

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

This study utilizes a massive 172-billion-token evaluation across diverse models, context lengths, and hardware to reveal that while model selection is the primary determinant of accuracy, hallucination rates in document Q&A rise significantly with context length and vary non-linearly with temperature, highlighting that grounding ability and fabrication resistance are distinct capabilities.

JV RoigTue, 10 Ma💬 cs.CL

← Previous Next →