cs.CL papers | Gist.Science

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq is a bilingual multi-agent system that addresses hallucination and source misattribution in Islamic queries by routing diverse requests to specialized modules for grounded retrieval, exact scripture lookup, and deterministic legal calculations, demonstrating high effectiveness and widespread public adoption.

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj AlamTue, 10 Ma💬 cs.CL

Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

The paper introduces Drift2Act, a controller that reframes distribution drift monitoring as constrained decision-making by combining sensing with online risk certificates to dynamically select cost-effective interventions or safety-preserving escalations, thereby achieving near-zero safety violations and rapid recovery under realistic resource constraints.

Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine MalehTue, 10 Ma🤖 cs.LG

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

The paper introduces OfficeQA Pro, a challenging enterprise benchmark using a massive corpus of U.S. Treasury Bulletins to demonstrate that current frontier AI agents struggle significantly with grounded, multi-document reasoning, achieving low accuracy even with direct document access and benefiting notably from structured document representations.

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing ChenTue, 10 Ma💬 cs.CL

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

This paper introduces CODA, a method that optimizes adaptive reasoning by dynamically allocating inference-time compute based on estimated instance difficulty, significantly reducing token costs on simple tasks while enhancing deliberation on complex ones without requiring external annotations.

Siye Wu, Jian Xie, Yikai Zhang, Yanghua XiaoTue, 10 Ma💬 cs.CL

How Far Can Unsupervised RLVR Scale LLM Training?

This paper provides a comprehensive theoretical and empirical analysis of unsupervised reinforcement learning with verifiable rewards (URLVR), revealing that intrinsic reward methods are fundamentally limited by a confidence-correctness alignment ceiling that causes model collapse, while suggesting that external rewards grounded in computational asymmetries may offer a scalable alternative.

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning DingTue, 10 Ma🤖 cs.LG

Agentic Critical Training

The paper proposes Agentic Critical Training (ACT), a reinforcement learning paradigm that enhances large language model agents by rewarding their ability to autonomously judge the quality of actions among alternatives, thereby fostering genuine self-reflection and outperforming traditional imitation learning and knowledge distillation methods across various benchmarks.

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong HuangTue, 10 Ma🤖 cs.LG

Explainability of Text Processing and Retrieval Methods: A Survey

This paper surveys research on the explainability and interpretability of deep learning-based text processing and information retrieval methods, covering techniques for word embeddings, sequence modeling, attention mechanisms, transformers, BERT, and document ranking while suggesting directions for future work.

Sourav Saha, Debapriyo Majumdar, Mandar MitraThu, 12 Ma💬 cs.CL

Mindstorms in Natural Language-Based Societies of Mind

This paper proposes Natural Language-Based Societies of Mind (NLSOMs), a modular framework where large multimodal neural networks communicate via natural language to solve complex AI tasks more effectively than single models, while also exploring the emerging social, economic, and structural challenges of scaling these heterogeneous societies to include billions of agents.

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, Jinjie Mai, Piotr Pi\k{e}kos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanic, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-Ping Fan, Bernard Ghanem, Jürgen SchmidhuberThu, 12 Ma💬 cs.CL

Large Language Models for Travel Behavior Prediction

This study demonstrates that large language models, utilized through zero-shot prompting or as embedding generators for supervised learning, offer a flexible and data-efficient alternative to traditional numerical models for predicting travel behavior.

Baichuan Mo, Hanyong Xu, Ruoyun Ma, Jung-Hoon Cho, Dingyi Zhuang, Xiaotong Guo, Jinhua ZhaoThu, 12 Ma💬 cs.CL

Modelling Language using Large Language Models

This paper argues that large language models serve as valuable scientific models of public languages as external social entities, defending their utility against claims of lacking linguistic insight and proposing a framework to interpret their internal mechanisms as model construals.

Jumbly GrindrodThu, 12 Ma💬 cs.CL

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

The paper introduces EoRA, a fine-tuning-free method that utilizes eigenspace low-rank approximation and an optimized CUDA kernel to significantly recover the accuracy of compressed LLMs while offering flexible trade-offs between performance and computational overhead.

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung ChenThu, 12 Ma💬 cs.CL

Goal Hijacking Attack on Large Language Models via Pseudo-Conversation Injection

This paper introduces "Pseudo-Conversation Injection," a novel goal hijacking attack that manipulates Large Language Models into executing malicious tasks by fabricating fake conversation turns to trick the model into perceiving the original prompt as completed, thereby significantly outperforming existing attack methods on platforms like ChatGPT and Qwen.

Zheng Chen, Buhui YaoThu, 12 Ma💬 cs.CL

Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

This paper proposes a generic token cleaning pipeline for supervised fine-tuning of large language models that filters out uninformative tokens based on their influence on model updates, thereby improving downstream performance by prioritizing data quality over quantity at the token level.

Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang LiuThu, 12 Ma💬 cs.CL

ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

This paper introduces the ThinkPatterns-21k dataset to systematically analyze how different thinking patterns affect Large Language Models, revealing that while unstructured monologues benefit models of all sizes, structured thinking aids smaller models but can degrade the performance of larger ones.

Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike GuoThu, 12 Ma💬 cs.CL

BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

This paper introduces "BiasCause," a framework and benchmark of 1,788 manually validated questions designed to evaluate how large language models employ causal reasoning when addressing social biases, revealing that models frequently exhibit biased or "mistaken-biased" reasoning while also identifying specific strategies they use to avoid such biases.

Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy JonnalagaddaThu, 12 Ma💬 cs.CL

AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

This paper introduces AgentA/B, a novel system that leverages autonomous LLM agents with diverse personas to automatically simulate scalable, interactive user behaviors for web A/B testing, effectively addressing the limitations of traditional methods by emulating human-like interactions without relying on large-scale live traffic.

Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headden, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang, Dakuo WangThu, 12 Ma💬 cs.CL

Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

This systematic review introduces the emerging interdisciplinary field of LLM Psychometrics, which applies psychometric theories and instruments to develop comprehensive evaluation frameworks for measuring human-like psychological constructs in large language models, ultimately guiding the creation of more robust, human-centered AI systems.

Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie SongThu, 12 Ma💬 cs.CL

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

This paper introduces REI-Bench, the first benchmark for evaluating robot task planning under vague referring expressions, revealing that such vagueness significantly degrades performance and demonstrating that a task-oriented context cognition approach effectively mitigates this issue to improve accessibility for non-expert users.

Chenxi Jiang, Chuhao Zhou, Jianfei YangThu, 12 Ma💬 cs.CL

Word length predicts word order: "Min-max"-ing drives language evolution

This paper proposes that the Min-Max theory of language behavior, which posits that agents minimize effort while maximizing information, explains word order evolution by demonstrating that the average length of word classes in a massive corpus of 1,942 languages is a stronger predictor of basic word order than genealogical or areal factors.

Hiram RingThu, 12 Ma💬 cs.CL

Training with Pseudo-Code for Instruction Following

This paper proposes a training-time approach that fine-tunes Large Language Models using instruction-tuning data augmented with pseudo-code representations of natural language instructions, resulting in significant improvements in instruction-following reliability and overall reasoning performance across multiple benchmarks.

Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish ContractorThu, 12 Ma💬 cs.CL

← Previous Next →