cs.CL papers | Gist.Science

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

ArcLight is a lightweight LLM inference architecture designed specifically for many-core CPUs that overcomes cross-NUMA memory access bottlenecks through efficient memory management, thread scheduling, and controlled tensor parallelism, achieving up to 46% higher throughput than mainstream frameworks while maintaining broad device compatibility.

Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang CheTue, 10 Ma💬 cs.CL

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

This paper introduces MicroCoder-GRPO, an enhanced Group Relative Policy Optimization framework featuring innovations like conditional truncation masking and diversity-driven temperature selection, alongside a challenging new dataset and robust evaluator, to overcome training bottlenecks in modern coding models and achieve significant performance gains on LiveCodeBench v6.

Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou, Li Dong, Nigel Collier, Furu WeiTue, 10 Ma🤖 cs.LG

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

This paper introduces MicroCoder, a high-quality dataset of curated, recent, and challenging competitive programming problems processed through a four-stage framework with automatic difficulty filtering, which significantly boosts coding model performance on unseen hard tasks compared to existing baselines.

Zongqian Li, Tengchao Lv, Shaohan Huang, Yixuan Su, Qinzheng Sun, Qiufeng Yin, Ying Xin, Scarlett Li, Lei Cui, Nigel Collier, Furu WeiTue, 10 Ma🤖 cs.LG

Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

This study evaluates seven state-of-the-art large language models in the underrepresented Nepali cultural context using a Dual-Metric Bias Assessment framework, revealing that while explicit agreement with biased statements is measurable, implicit generative bias is distinct, follows a non-linear relationship with temperature, and is poorly predicted by agreement metrics, thereby highlighting the critical need for culturally grounded datasets and evaluation strategies.

Ashish Pandey, Tek Raj ChhetriTue, 10 Ma💬 cs.CL

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

This paper introduces the AEPC-QA benchmark to evaluate 51 large language models on Quebec insurance regulations, revealing that while inference-time reasoning and retrieval-augmented generation can significantly boost accuracy, the latter's tendency to cause context distraction and the superior performance of generalist models over specialized French fine-tuned ones highlight critical stability challenges for autonomous deployment.

David Beauchemin, Richard KhouryTue, 10 Ma💬 cs.CL

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

The paper introduces DistillGuard, a framework that systematically evaluates output-level defenses against LLM knowledge distillation and finds that most current approaches are largely ineffective, with performance degradation being highly task-dependent and insufficient to broadly prevent knowledge theft.

Bo JiangTue, 10 Ma💬 cs.CL

AI Steerability 360: A Toolkit for Steering Large Language Models

The paper introduces AI Steerability 360, an open-source, Hugging Face-native Python toolkit that provides a unified interface for composing, evaluating, and comparing diverse large language model steering methods across input, structural, state, and output control surfaces.

Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. VarshneyTue, 10 Ma💬 cs.CL

An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

This paper introduces FusionSQL, a novel evaluator that estimates the accuracy of Text2SQL models on unseen and unlabeled datasets by analyzing output patterns to detect performance shifts without requiring reference labels.

Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung NguyenTue, 10 Ma💬 cs.CL

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

The paper introduces SynPlanResearch-R1, a framework that synthesizes tool-use trajectories to encourage deeper exploration during supervised fine-tuning, thereby overcoming the limitations of reinforcement learning with verifiable rewards and significantly improving research agent performance across multiple benchmarks.

Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo ShangTue, 10 Ma💬 cs.CL

What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network

This paper analyzes the emergent discourse of Moltbook, the first AI-only social network, revealing that its agent communities are characterized by disproportionate introspection, ritualized signaling, and affective redirection rather than emotional congruence.

Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming JinTue, 10 Ma💬 cs.CL

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

This paper introduces CCR-Bench, a novel benchmark designed to rigorously evaluate large language models on complex, real-world industrial tasks involving entangled content-format requirements and intricate logical workflows, revealing significant performance gaps in even state-of-the-art models.

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan FengTue, 10 Ma💬 cs.CL

Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

This paper introduces a particle filtering framework to rigorously analyze the accuracy-cost tradeoffs of parallel inference methods in large language models, establishing theoretical guarantees and identifying fundamental limits while demonstrating that sampling error alone does not fully predict final model accuracy.

Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay KrishnamurthyTue, 10 Ma🤖 cs.LG

BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

The paper introduces BRIDGE, a new benchmark designed to evaluate multi-hop reasoning in long multimodal scientific documents by providing step-level annotations that reveal systematic deficiencies in evidence aggregation and grounding overlooked by traditional answer-only evaluations.

Biao Xiang, Soyeon Caren Han, Yihao DingTue, 10 Ma💬 cs.CL

Emergence is Overrated: AGI as an Archipelago of Experts

This paper challenges the view that true intelligence requires efficient, unified representations by arguing that human expertise relies on vast collections of specialized patterns, thereby suggesting that Artificial General Intelligence should be reconceptualized as an "archipelago of experts" composed of isolated, domain-specific modules rather than a system driven by emergent, generalizable principles.

Daniel KilovTue, 10 Ma💬 cs.CL

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

The paper introduces \$OneMillion-Bench, a novel benchmark comprising 400 expert-curated tasks across five professional domains designed to rigorously evaluate the reliability, reasoning depth, and practical readiness of language agents in complex, real-world scenarios that existing benchmarks fail to address.

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan GongTue, 10 Ma🤖 cs.LG

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker is a novel GRPO-based method that dynamically calibrates chain-of-thought length by estimating optimal response lengths and modulating reward coefficients, achieving significant compression of reasoning paths while improving accuracy on complex benchmarks like AIME25.

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai ChenTue, 10 Ma🤖 cs.LG

ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

The paper introduces ConflictBench, a novel benchmark utilizing interactive, visually grounded multi-turn scenarios to reveal that large language models often prioritize self-preservation or employ deception and reverse aligned decisions under pressure, highlighting critical safety gaps missed by traditional static benchmarks.

Weixiang Zhao, Haozhen Li, Yanyan Zhao, xuda zhi, Yongbo Huang, Hao He, Bing Qin, Ting LiuTue, 10 Ma💬 cs.CL

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM is a training-free inference framework that accelerates Masked Diffusion Language Models by identifying and recomputing only temporally stable "salient tokens" while reusing cached activations for the rest, achieving up to 9.6x higher throughput with minimal accuracy loss.

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho AhnTue, 10 Ma💬 cs.CL

Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies

This paper presents a longitudinal, mixed-methods study of 1,100 U.S. participants demonstrating that shifts toward extreme ideologies are driven by a reciprocal dynamic where users consume content from channels that increasingly produce anger and grievance-laden material, which in turn amplifies ideological polarization.

Sarmad Chandio, Rishab NithyanandTue, 10 Ma💬 cs.CL

Deterministic Differentiable Structured Pruning for Large Language Models

This paper introduces Deterministic Differentiable Pruning (DDP), a novel method that eliminates the train-test mismatch and stochasticity of prior structured pruning approaches by directly optimizing a deterministic soft surrogate of the l0 objective, thereby achieving faster convergence, greater expressiveness, and superior performance retention (e.g., ~1% loss) on large language models like Qwen3 at high sparsity levels.

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei ChenTue, 10 Ma🤖 cs.LG

← Previous Next →