Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq is a bilingual multi-agent system that addresses hallucination and source misattribution in Islamic queries by routing diverse requests to specialized modules for grounded retrieval, exact scripture lookup, and deterministic legal calculations, demonstrating high effectiveness and widespread public adoption.

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj AlamTue, 10 Ma💬 cs.CL

Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

The paper introduces Drift2Act, a controller that reframes distribution drift monitoring as constrained decision-making by combining sensing with online risk certificates to dynamically select cost-effective interventions or safety-preserving escalations, thereby achieving near-zero safety violations and rapid recovery under realistic resource constraints.

Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine MalehTue, 10 Ma🤖 cs.LG

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

The paper introduces OfficeQA Pro, a challenging enterprise benchmark using a massive corpus of U.S. Treasury Bulletins to demonstrate that current frontier AI agents struggle significantly with grounded, multi-document reasoning, achieving low accuracy even with direct document access and benefiting notably from structured document representations.

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing ChenTue, 10 Ma💬 cs.CL

How Far Can Unsupervised RLVR Scale LLM Training?

This paper provides a comprehensive theoretical and empirical analysis of unsupervised reinforcement learning with verifiable rewards (URLVR), revealing that intrinsic reward methods are fundamentally limited by a confidence-correctness alignment ceiling that causes model collapse, while suggesting that external rewards grounded in computational asymmetries may offer a scalable alternative.

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning DingTue, 10 Ma🤖 cs.LG

Mindstorms in Natural Language-Based Societies of Mind

This paper proposes Natural Language-Based Societies of Mind (NLSOMs), a modular framework where large multimodal neural networks communicate via natural language to solve complex AI tasks more effectively than single models, while also exploring the emerging social, economic, and structural challenges of scaling these heterogeneous societies to include billions of agents.

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, Jinjie Mai, Piotr Pi\k{e}kos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanic, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-Ping Fan, Bernard Ghanem, Jürgen SchmidhuberThu, 12 Ma💬 cs.CL

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

The paper introduces EoRA, a fine-tuning-free method that utilizes eigenspace low-rank approximation and an optimized CUDA kernel to significantly recover the accuracy of compressed LLMs while offering flexible trade-offs between performance and computational overhead.

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung ChenThu, 12 Ma💬 cs.CL

ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

This paper introduces the ThinkPatterns-21k dataset to systematically analyze how different thinking patterns affect Large Language Models, revealing that while unstructured monologues benefit models of all sizes, structured thinking aids smaller models but can degrade the performance of larger ones.

Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike GuoThu, 12 Ma💬 cs.CL

BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

This paper introduces "BiasCause," a framework and benchmark of 1,788 manually validated questions designed to evaluate how large language models employ causal reasoning when addressing social biases, revealing that models frequently exhibit biased or "mistaken-biased" reasoning while also identifying specific strategies they use to avoid such biases.

Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy JonnalagaddaThu, 12 Ma💬 cs.CL

AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

This paper introduces AgentA/B, a novel system that leverages autonomous LLM agents with diverse personas to automatically simulate scalable, interactive user behaviors for web A/B testing, effectively addressing the limitations of traditional methods by emulating human-like interactions without relying on large-scale live traffic.

Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headden, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang, Dakuo WangThu, 12 Ma💬 cs.CL

Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

This systematic review introduces the emerging interdisciplinary field of LLM Psychometrics, which applies psychometric theories and instruments to develop comprehensive evaluation frameworks for measuring human-like psychological constructs in large language models, ultimately guiding the creation of more robust, human-centered AI systems.

Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie SongThu, 12 Ma💬 cs.CL