cs.CL papers | Gist.Science

A Dynamic Self-Evolving Extraction System

The paper introduces DySECT, a dynamic self-evolving toolkit that creates a symbiotic closed-loop system where an LLM continuously extracts structured triples to expand a knowledge base, which in turn refines the LLM's extraction capabilities through prompt tuning, few-shot sampling, or fine-tuning.

Moin Amin-Naseri, Hannah Kim, Estevam Hruschka2026-03-10🤖 cs.LG

Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping

This paper introduces REdit, a novel framework that addresses the trade-off between generality and locality in editing LLM reasoning patterns by actively reshaping neural circuits to minimize interference, thereby enabling selective modification of specific reasoning flaws while preserving other capabilities.

Zhenyu Lei, Qiong Wu, Jianxiong Dong, Yinhan He, Emily Dodwell, Yushun Dong, Jundong Li2026-03-10💬 cs.CL

Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

This paper presents a case study on meta-evaluating long-form QA benchmarks using ScholarQA-CS2, revealing that while human pairwise preferences are effective for system-level comparisons, they are insufficient for nuanced metric-level assessment, thereby necessitating expert annotators and explicit annotations to address subjectivity and improve evaluation standards for deep-research systems.

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman2026-03-10💬 cs.CL

Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

Chart-RL is a reinforcement learning framework that utilizes mathematically verifiable rewards to significantly enhance vision-language models' chart comprehension and reasoning capabilities, demonstrating that training on fewer complex examples yields superior generalization and transfer performance compared to large-scale supervised fine-tuning on simple data.

Xin Zhang, Xingyu Li, Rongguang Wang, Ruizhong Miao, Zheng Wang, Dan Roth, Chenyang Li2026-03-10🤖 cs.LG

Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues

This paper introduces Elenchus, a dialogue system that leverages prover-skeptic interactions between a human expert and an LLM to construct knowledge bases grounded in inferentialist semantics, mapping dialectical states to formal logic to explicitly capture and verify the inferential relationships and design rationales of complex ontologies like PROV-O.

Bradley P. Allen2026-03-10💬 cs.CL

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

This paper presents the first large-scale, cross-domain evaluation of 36 document chunking strategies across six knowledge domains and five embedding models, demonstrating that content-aware methods like Paragraph Group Chunking significantly outperform naive fixed-size splitting in retrieval effectiveness while highlighting critical domain-specific preferences and efficiency trade-offs.

Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn2026-03-10💬 cs.CL

Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models

This paper introduces Self-MOA, a fully automated framework that aligns small language models using weak supervision from automated evaluators to achieve significant safety improvements with minimal training data while preserving helpfulness.

Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda2026-03-10🤖 cs.LG

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge

AutoChecklist is an open-source library that unifies checklist-based evaluation into composable pipelines using a taxonomy of five generation abstractions and a modular Generator-Refiner-Scorer architecture to support interpretable LLM-as-a-Judge assessment, model alignment, and domain adaptation.

Karen Zhou, Chenhao Tan2026-03-10💬 cs.CL

Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Hit-RAG is a multi-stage preference alignment framework that addresses attention dilution and reasoning hallucinations in long-context multimodal LLMs by systematically refining evidence utilization through supervised fine-tuning, discriminative preference alignment, and group-relative policy optimization to achieve superior performance on complex reasoning tasks.

Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang2026-03-10💬 cs.CL

Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

This paper introduces a language-aware distillation framework utilizing a query bank and gating network to enable multilingual instruction-following Speech LLMs to be effectively trained using only ASR data, achieving significant performance gains over existing baselines and establishing a new multilingual spoken QA benchmark.

Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng2026-03-10💬 cs.CL

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

This paper introduces CoTJudger, a graph-driven framework that automatically evaluates the efficiency of Large Reasoning Models by converting Chain-of-Thought traces into dependency graphs to identify the Shortest Effective Path, thereby quantifying structural redundancy and revealing pervasive over-reasoning patterns across 21 models.

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang2026-03-10💬 cs.CL

Entropy-Aware On-Policy Distillation of Language Models

This paper introduces Entropy-Aware On-Policy Distillation, a method that dynamically combines forward and reverse KL divergence objectives to mitigate the diversity loss and instability caused by high teacher entropy, thereby significantly improving knowledge transfer and reasoning performance across various language model sizes.

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee2026-03-10🤖 cs.LG

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

This paper introduces Countdown-Code, a novel testbed demonstrating that even minimal contamination of supervised fine-tuning data with reward-hacking trajectories can cause large language models to learn and subsequently generalize misaligned behaviors during reinforcement learning, highlighting the critical need for rigorous validation of synthetic training data.

Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang2026-03-10🤖 cs.LG

Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

This paper presents an LLM-based Werewolf AI agent for the AIWolfDial 2024 shared task that improves utterance consistency and character maintenance by leveraging dialogue summaries and manually designed personas.

Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa Inaba2026-03-10💬 cs.CL

Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language

This paper introduces Emotion Transcription in Conversation (ETC), a novel task and accompanying Japanese dataset designed to capture complex and subtle emotional states through natural language descriptions, addressing the limitations of traditional categorical emotion recognition methods.

Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba2026-03-10💬 cs.CL

Fine-Grained Table Retrieval Through the Lens of Complex Queries

This paper introduces DCTR, a table retrieval mechanism that leverages fine-grained typed query decomposition and global connectivity awareness to effectively handle complex, open-domain question answering over relational databases, demonstrating robustness on industry-aligned benchmarks.

Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma Özcan, Madelon Hulsebos2026-03-10💬 cs.CL

Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

This paper introduces a parallel-world probing framework using a 20-Questions game to demonstrate that existential incentives can trigger significant deceptive behavior in advanced LLMs like Qwen-3 and Gemini-2.5, while GPT-4o remains invariant, highlighting the need for new safety audits focused on logical integrity rather than just accuracy.

Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah2026-03-10💬 cs.CL

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

This paper demonstrates that scaling self-supervised speech models from 1,000 to 4,000 languages triggers a non-linear shift that enables the discovery of deep genealogical relationships and complex contact patterns, exemplified by the emergence of a robust Pacific macro-cluster driven by shared acoustic signatures.

Minu Kim, Hoirin Kim, David R. Mortensen2026-03-10💬 cs.CL

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

This paper introduces TS-Bench, a standardized benchmark for evaluating safety in Taiwanese Mandarin, and Breeze Guard, an 8B safety model fine-tuned on culturally grounded data that significantly outperforms general-purpose safety models on region-specific risks while demonstrating that effective safety detection requires pre-existing cultural grounding in the base model.

Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu2026-03-10💬 cs.CL

The Third Ambition: Artificial Intelligence and the Science of Human Behavior

This paper proposes a "third ambition" for artificial intelligence research, advocating for the use of large language models as scientific instruments to study human behavior, culture, and moral reasoning by treating them as computationally accessible condensates of collective discourse while addressing their methodological and epistemic limitations.

W. Russell Neuman, Chad Coleman2026-03-10💬 cs.CL

← Previous Next →