cs.CL papers | Gist.Science

High-Fidelity Pruning for Large Language Models

This paper proposes High-Fidelity Pruning (HFPrune), a method that utilizes the information entropy of a model's output distribution to evaluate neuron importance during Taylor-based pruning, thereby overcoming the limitations of standard cross-entropy criteria and the computational overhead of self-distillation to achieve superior performance on LLaMA and Qwen models without requiring an additional teacher model.

Yijun Zhu, Jianxin Wang, Chengchao ShenTue, 10 Ma💬 cs.CL

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

This paper introduces JudgeBiasBench, a comprehensive benchmark for systematically evaluating judgment biases across 12 types in both generative and discriminative LLM-based judges, and proposes a bias-aware training framework using reinforcement and contrastive learning to effectively mitigate these biases while preserving evaluation performance.

Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu, Conghui Zhu, Tiejun Zhao, Muyun YangTue, 10 Ma💬 cs.CL

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

This paper introduces the Dual-Consensus Weak-to-Strong (DC-W2S) framework, which enhances the reliability of Process Reward Models in biological reasoning by strategically filtering noisy weak supervision signals through self- and neighborhood-consensus metrics to enable robust training without exhaustive expert annotation.

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele ScaliaTue, 10 Ma🤖 cs.LG

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

The paper introduces Ramsa, a 41-hour sociolinguistically diverse Emirati Arabic speech corpus comprising 157 speakers across various subdialects and topics, which serves as a foundational resource for developing low-resource ASR and TTS technologies while establishing initial performance baselines for existing models.

Rania Al-SabbaghTue, 10 Ma💬 cs.CL

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

EvoScientist is an evolving multi-agent framework that leverages persistent memory and self-evolution to continuously refine research strategies, thereby outperforming existing state-of-the-art systems in generating novel scientific ideas and executing successful experiments for end-to-end discovery.

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, Xiaohui YanTue, 10 Ma💬 cs.CL

Gradually Excavating External Knowledge for Implicit Complex Question Answering

This paper proposes a gradual knowledge excavation framework that enables large language models to iteratively acquire external information and perform logical reasoning, achieving state-of-the-art performance on complex open-domain question answering with significantly fewer parameters.

Chang Liu, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Edmund Y. Lam, Ngai WongTue, 10 Ma💬 cs.CL

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

This paper introduces two new benchmarks, WinoMTeus and FLORES+Gender, to evaluate gender bias in machine translation involving Basque, revealing that current large language models and MT systems exhibit a systematic preference for masculine forms when translating between genderless and gendered languages.

Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara PerezTue, 10 Ma💬 cs.CL

RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

RexDrug is a reasoning-enhanced large language model framework that utilizes a two-stage training strategy combining multi-agent collaborative reasoning generation and reinforcement learning to achieve state-of-the-art performance in extracting complex n-ary drug combinations from biomedical literature.

Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei LinTue, 10 Ma💬 cs.CL

Is continuous CoT better suited for multi-lingual reasoning?

This paper demonstrates that performing reasoning in a continuous latent space via the CODI framework significantly outperforms standard explicit reasoning in multilingual settings, particularly for low-resource and zero-shot scenarios, while achieving substantial compression of reasoning traces.

Ali Hamza Bashir, Behzad Shomali, Markus Frey, Mehdi Ali, Rafet Sifa, David BerghausTue, 10 Ma🤖 cs.LG

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

This paper introduces TildeOpen LLM, a 30-billion-parameter open-weight model that achieves superior performance across 34 European languages, particularly for low-resource groups, by employing curriculum learning and dataset upsampling to address data imbalances without requiring increased computational resources.

Toms Bergmanis, Martins Kronis, Ingus J\=anis Pretkalninš, D\=avis Nicmanis, Jelizaveta Jelinska, Roberts Rozis, Rinalds V\=iksna, M\=arcis PinnisTue, 10 Ma💬 cs.CL

Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

This paper introduces CoPaLink, an automated approach that enhances bioinformatics workflow reproducibility by integrating Named Entity Recognition and entity linking to connect tool mentions in scientific papers with their corresponding implementations in executable workflow code.

Clémence Sebe, Olivier Ferret, Aurélie Névéol, Mahdi Esmailoghli, Ulf Leser, Sarah Cohen-BoulakiaTue, 10 Ma💬 cs.CL

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

This paper argues that current evaluations of attacks on PII removal techniques are flawed due to unmitigated data leakage and contamination, creating a paradox where trustworthy research requires access to private data that is inherently restricted from public scrutiny.

Sebastian Ochs, Ivan HabernalTue, 10 Ma💬 cs.CL

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

DualTurn is a dual-channel generative speech model that learns natural turn-taking dynamics through unsupervised pretraining on conversational audio and fine-tuning to predict agent actions, outperforming existing methods in both action prediction accuracy and turn-boundary anticipation while enabling tool-calling capabilities.

Shangeth RajaaTue, 10 Ma💬 cs.CL

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

This paper introduces the Cross-Lingual Transfer Matrix (CLTM) to systematically quantify language-dependent performance variations in paralinguistic tasks like gender identification and speaker verification, revealing that despite their acoustic nature, these tasks exhibit distinct cross-lingual transfer patterns when using multilingual HuBERT-based encoders.

Pol Buitrago, Oriol Pareras, Federico Costa, Javier HernandoTue, 10 Ma💬 cs.CL

Fibration Policy Optimization

This paper introduces Fibration Policy Optimization (FiberPO), a unified framework that bridges trust-region theory and compositional algebraic structures to enable principled, multi-scale stability control in large language model training through the novel Aggregational Policy Censoring Objective and Fiber Bundle Gating mechanism.

Chang Li, Tshihao Tsu, Yaren Zhang, Chao Xue, Xiaodong HeTue, 10 Ma🤖 cs.LG

Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies

This paper investigates how training randomness affects the stability of Transformer model explanations, demonstrating that while syntactic context, target classes, and task types all significantly influence this sensitivity, the impact is smallest for context, moderate for classes, and largest for tasks.

Romain Loncour, Jérémie Bogaert, François-Xavier StandaertTue, 10 Ma💬 cs.CL

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

This paper proposes a zero-AV-resource framework for audiovisual speech recognition that generates synthetic talking-head videos by lip-syncing static facial images with real audio, successfully enabling high-performance model training for under-resourced languages like Catalan without the need for labeled video corpora.

Pol Buitrago, Pol Gàlvez, Oriol Pareras, Javier HernandoTue, 10 Ma💬 cs.CL

Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

The paper proposes CoFiCot, an adaptive coarse-to-fine framework that dynamically allocates test-time computation by triaging queries based on multi-metric difficulty assessment and applying stateful, context-aware refinement to balance efficiency and reasoning accuracy.

Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang, Jihua ZhuTue, 10 Ma💬 cs.CL

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

This paper presents the NCL-UoR system for SemEval-2026 Task 5, demonstrating that a structured prompting strategy with explicit decision rules for Large Language Models outperforms both embedding-based methods and fine-tuned transformers in rating word sense plausibility.

Tong Wu, Thanet Markchom, Huizhi LiangTue, 10 Ma💬 cs.CL

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

This study utilizes a massive 172-billion-token evaluation across diverse models, context lengths, and hardware to reveal that while model selection is the primary determinant of accuracy, hallucination rates in document Q&A rise significantly with context length and vary non-linearly with temperature, highlighting that grounding ability and fabrication resistance are distinct capabilities.

JV RoigTue, 10 Ma💬 cs.CL

← Previous Next →