High-Fidelity Pruning for Large Language Models

This paper proposes High-Fidelity Pruning (HFPrune), a method that utilizes the information entropy of a model's output distribution to evaluate neuron importance during Taylor-based pruning, thereby overcoming the limitations of standard cross-entropy criteria and the computational overhead of self-distillation to achieve superior performance on LLaMA and Qwen models without requiring an additional teacher model.

Yijun Zhu, Jianxin Wang, Chengchao ShenTue, 10 Ma💬 cs.CL

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

This paper introduces JudgeBiasBench, a comprehensive benchmark for systematically evaluating judgment biases across 12 types in both generative and discriminative LLM-based judges, and proposes a bias-aware training framework using reinforcement and contrastive learning to effectively mitigate these biases while preserving evaluation performance.

Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu, Conghui Zhu, Tiejun Zhao, Muyun YangTue, 10 Ma💬 cs.CL

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

This paper introduces the Dual-Consensus Weak-to-Strong (DC-W2S) framework, which enhances the reliability of Process Reward Models in biological reasoning by strategically filtering noisy weak supervision signals through self- and neighborhood-consensus metrics to enable robust training without exhaustive expert annotation.

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele ScaliaTue, 10 Ma🤖 cs.LG

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

EvoScientist is an evolving multi-agent framework that leverages persistent memory and self-evolution to continuously refine research strategies, thereby outperforming existing state-of-the-art systems in generating novel scientific ideas and executing successful experiments for end-to-end discovery.

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, Xiaohui YanTue, 10 Ma💬 cs.CL

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

This paper introduces TildeOpen LLM, a 30-billion-parameter open-weight model that achieves superior performance across 34 European languages, particularly for low-resource groups, by employing curriculum learning and dataset upsampling to address data imbalances without requiring increased computational resources.

Toms Bergmanis, Martins Kronis, Ingus J\=anis Pretkalninš, D\=avis Nicmanis, Jelizaveta Jelinska, Roberts Rozis, Rinalds V\=iksna, M\=arcis PinnisTue, 10 Ma💬 cs.CL

Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

This paper introduces CoPaLink, an automated approach that enhances bioinformatics workflow reproducibility by integrating Named Entity Recognition and entity linking to connect tool mentions in scientific papers with their corresponding implementations in executable workflow code.

Clémence Sebe, Olivier Ferret, Aurélie Névéol, Mahdi Esmailoghli, Ulf Leser, Sarah Cohen-BoulakiaTue, 10 Ma💬 cs.CL

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

This paper introduces the Cross-Lingual Transfer Matrix (CLTM) to systematically quantify language-dependent performance variations in paralinguistic tasks like gender identification and speaker verification, revealing that despite their acoustic nature, these tasks exhibit distinct cross-lingual transfer patterns when using multilingual HuBERT-based encoders.

Pol Buitrago, Oriol Pareras, Federico Costa, Javier HernandoTue, 10 Ma💬 cs.CL

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

This study utilizes a massive 172-billion-token evaluation across diverse models, context lengths, and hardware to reveal that while model selection is the primary determinant of accuracy, hallucination rates in document Q&A rise significantly with context length and vary non-linearly with temperature, highlighting that grounding ability and fabrication resistance are distinct capabilities.

JV RoigTue, 10 Ma💬 cs.CL