cs.CL papers | Gist.Science

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

This paper introduces CCR-Bench, a novel benchmark designed to rigorously evaluate large language models on complex, real-world industrial tasks involving entangled content-format requirements and intricate logical workflows, revealing significant performance gaps in even state-of-the-art models.

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng2026-03-10💬 cs.CL

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

This paper introduces a particle filtering framework to rigorously analyze the accuracy-cost tradeoffs of parallel inference methods in large language models, establishing theoretical guarantees and identifying fundamental limits while demonstrating that sampling error alone does not fully predict final model accuracy.

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy2026-03-10🤖 cs.LG

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

The paper introduces BRIDGE, a new benchmark designed to evaluate multi-hop reasoning in long multimodal scientific documents by providing step-level annotations that reveal systematic deficiencies in evidence aggregation and grounding overlooked by traditional answer-only evaluations.

Biao Xiang, Soyeon Caren Han, Yihao Ding2026-03-10💬 cs.CL

Emergence is Overrated: AGI as an Archipelago of Experts

This paper challenges the view that true intelligence requires efficient, unified representations by arguing that human expertise relies on vast collections of specialized patterns, thereby suggesting that Artificial General Intelligence should be reconceptualized as an "archipelago of experts" composed of isolated, domain-specific modules rather than a system driven by emergent, generalizable principles.

Daniel Kilov2026-03-10💬 cs.CL

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

The paper introduces \$OneMillion-Bench, a novel benchmark comprising 400 expert-curated tasks across five professional domains designed to rigorously evaluate the reliability, reasoning depth, and practical readiness of language agents in complex, real-world scenarios that existing benchmarks fail to address.

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong2026-03-10🤖 cs.LG

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker is a novel GRPO-based method that dynamically calibrates chain-of-thought length by estimating optimal response lengths and modulating reward coefficients, achieving significant compression of reasoning paths while improving accuracy on complex benchmarks like AIME25.

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen2026-03-10🤖 cs.LG

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

The paper introduces ConflictBench, a novel benchmark utilizing interactive, visually grounded multi-turn scenarios to reveal that large language models often prioritize self-preservation or employ deception and reverse aligned decisions under pressure, highlighting critical safety gaps missed by traditional static benchmarks.

Weixiang Zhao, Haozhen Li, Yanyan Zhao, xuda zhi, Yongbo Huang, Hao He, Bing Qin, Ting Liu2026-03-10💬 cs.CL

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM is a training-free inference framework that accelerates Masked Diffusion Language Models by identifying and recomputing only temporally stable "salient tokens" while reusing cached activations for the rest, achieving up to 9.6x higher throughput with minimal accuracy loss.

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn2026-03-10💬 cs.CL

Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies

This paper presents a longitudinal, mixed-methods study of 1,100 U.S. participants demonstrating that shifts toward extreme ideologies are driven by a reciprocal dynamic where users consume content from channels that increasingly produce anger and grievance-laden material, which in turn amplifies ideological polarization.

Sarmad Chandio, Rishab Nithyanand2026-03-10💬 cs.CL

Deterministic Differentiable Structured Pruning for Large Language Models

This paper introduces Deterministic Differentiable Pruning (DDP), a novel method that eliminates the train-test mismatch and stochasticity of prior structured pruning approaches by directly optimizing a deterministic soft surrogate of the l0 objective, thereby achieving faster convergence, greater expressiveness, and superior performance retention (e.g., ~1% loss) on large language models like Qwen3 at high sparsity levels.

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen2026-03-10🤖 cs.LG

High-Fidelity Pruning for Large Language Models

This paper proposes High-Fidelity Pruning (HFPrune), a method that utilizes the information entropy of a model's output distribution to evaluate neuron importance during Taylor-based pruning, thereby overcoming the limitations of standard cross-entropy criteria and the computational overhead of self-distillation to achieve superior performance on LLaMA and Qwen models without requiring an additional teacher model.

Yijun Zhu, Jianxin Wang, Chengchao Shen2026-03-10💬 cs.CL

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

This paper introduces JudgeBiasBench, a comprehensive benchmark for systematically evaluating judgment biases across 12 types in both generative and discriminative LLM-based judges, and proposes a bias-aware training framework using reinforcement and contrastive learning to effectively mitigate these biases while preserving evaluation performance.

Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu, Conghui Zhu, Tiejun Zhao, Muyun Yang2026-03-10💬 cs.CL

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

This paper introduces the Dual-Consensus Weak-to-Strong (DC-W2S) framework, which enhances the reliability of Process Reward Models in biological reasoning by strategically filtering noisy weak supervision signals through self- and neighborhood-consensus metrics to enable robust training without exhaustive expert annotation.

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia2026-03-10🤖 cs.LG

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

The paper introduces Ramsa, a 41-hour sociolinguistically diverse Emirati Arabic speech corpus comprising 157 speakers across various subdialects and topics, which serves as a foundational resource for developing low-resource ASR and TTS technologies while establishing initial performance baselines for existing models.

Rania Al-Sabbagh2026-03-10💬 cs.CL

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

EvoScientist is an evolving multi-agent framework that leverages persistent memory and self-evolution to continuously refine research strategies, thereby outperforming existing state-of-the-art systems in generating novel scientific ideas and executing successful experiments for end-to-end discovery.

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, Xiaohui Yan2026-03-10💬 cs.CL

Gradually Excavating External Knowledge for Implicit Complex Question Answering

This paper proposes a gradual knowledge excavation framework that enables large language models to iteratively acquire external information and perform logical reasoning, achieving state-of-the-art performance on complex open-domain question answering with significantly fewer parameters.

Chang Liu, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Edmund Y. Lam, Ngai Wong2026-03-10💬 cs.CL

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

This paper introduces two new benchmarks, WinoMTeus and FLORES+Gender, to evaluate gender bias in machine translation involving Basque, revealing that current large language models and MT systems exhibit a systematic preference for masculine forms when translating between genderless and gendered languages.

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez2026-03-10💬 cs.CL

RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

RexDrug is a reasoning-enhanced large language model framework that utilizes a two-stage training strategy combining multi-agent collaborative reasoning generation and reinforcement learning to achieve state-of-the-art performance in extracting complex n-ary drug combinations from biomedical literature.

Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei Lin2026-03-10💬 cs.CL

Is continuous CoT better suited for multi-lingual reasoning?

This paper demonstrates that performing reasoning in a continuous latent space via the CODI framework significantly outperforms standard explicit reasoning in multilingual settings, particularly for low-resource and zero-shot scenarios, while achieving substantial compression of reasoning traces.

Ali Hamza Bashir, Behzad Shomali, Markus Frey, Mehdi Ali, Rafet Sifa, David Berghaus2026-03-10🤖 cs.LG

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

This paper introduces TildeOpen LLM, a 30-billion-parameter open-weight model that achieves superior performance across 34 European languages, particularly for low-resource groups, by employing curriculum learning and dataset upsampling to address data imbalances without requiring increased computational resources.

Toms Bergmanis, Martins Kronis, Ingus J\=anis Pretkalninš, D\=avis Nicmanis, Jelizaveta Jelinska, Roberts Rozis, Rinalds V\=iksna, M\=arcis Pinnis2026-03-10💬 cs.CL

← Previous Next →