cs.CL papers | Gist.Science

Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models

This paper presents a fully open-source, locally deployable pipeline using the Qwen2.5-72B model to accurately extract and link longitudinal tumor burden data from radiology reports in compliance with RECIST criteria, demonstrating that privacy-preserving open-source large language models can achieve clinically meaningful performance in oncology.

Luc Builtjes, Alessa HeringWed, 11 Ma💬 cs.CL

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

This paper identifies a systematic attention collapse pathology in BLOOM models caused by ALiBi positional encoding and introduces a surgical reinitialization technique that successfully recovers nearly all functional attention heads, demonstrating that pretrained attention configurations are suboptimal local minima.

Palmer SchallonWed, 11 Ma💬 cs.CL

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

This paper guides political scientists in choosing NLP strategies by demonstrating that fine-tuning general models like ModernBERT often suffices for high-frequency tasks, reserving the need for specialized, domain-specific models for rare event categories where performance gaps are most pronounced.

Shreyas MeherWed, 11 Ma💬 cs.CL

ALARM: Audio-Language Alignment for Reasoning Models

The paper introduces ALARM, a 4B-parameter audio-language model that employs a self-rephrasing strategy to align self-generated reasoning traces with auditory inputs and fuses multiple audio encoders, achieving state-of-the-art open-source performance on audio-reasoning benchmarks while preserving textual capabilities.

Petr Grinberg, Hassan ShahmohammadiWed, 11 Ma💬 cs.CL

Modelling the Diachronic Emergence of Phoneme Frequency Distributions

This paper demonstrates that key statistical regularities in phoneme frequency distributions, such as exponential-tailed patterns and the inverse relationship between inventory size and relative entropy, can emerge naturally from a stochastic model of diachronic sound change incorporating functional load and a stabilizing preference for inventory size, rather than requiring explicit optimization mechanisms.

Fermín Moscoso del Prado Martín, Suchir SalhanWed, 11 Ma💬 cs.CL

Self-hosted Lecture-to-Quiz: Local LLM MCQ Generation with Deterministic Quality Control

This paper presents an end-to-end, self-hosted pipeline that converts lecture PDFs into multiple-choice questions using a local LLM and deterministic quality control, ensuring privacy and accountability while releasing a validated 24-question dataset with a detailed warning taxonomy for educational use.

Seine A. ShintaniWed, 11 Ma💻 cs

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

This paper proposes "LLM as a Meta-Judge," a scalable framework that generates synthetic evaluation datasets through controlled semantic degradation to validate NLP metrics, demonstrating that this approach achieves high alignment with human benchmarks and offers a viable, cost-effective alternative to expensive human annotations.

Lukáš Eigler, Jindřich Libovický, David HurychWed, 11 Ma💬 cs.CL

Reward Prediction with Factorized World States

This paper introduces StateFactory, a method that leverages language models to transform unstructured observations into hierarchical, factorized world states, enabling accurate zero-shot reward prediction via semantic similarity and significantly improving agent planning performance across diverse domains.

Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale FungWed, 11 Ma💬 cs.CL

Quantifying and extending the coverage of spatial categorization data sets

This paper demonstrates that large language models can effectively align with human spatial categorization labels to guide the strategic expansion of the Topological Relations Picture Series (TRPS), resulting in a new dataset with 42 scenes that offers superior coverage of spatial relations compared to previous extensions.

Wanchun Li, Alexandra Carstensen, Yang Xu, Terry Regier, Charles KempWed, 11 Ma💬 cs.CL

LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression

LooComp is a lightweight, encoder-only Transformer framework that employs a margin-based leave-one-out strategy to efficiently compress retrieval contexts by identifying and retaining only query-critical sentences, thereby achieving high compression ratios without sacrificing question-answering performance.

Thao Do, Dinh Phu Tran, An Vo, Seon Kwon Kim, Daeyoung KimWed, 11 Ma💬 cs.CL

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

The paper proposes SPAR-K, a modality-aware early exit framework that accelerates interleaved spoken language model inference by employing a scheduled alternating-depth strategy for speech tokens, achieving significant reductions in decoding depth while preserving question-answering accuracy and perceptual quality without auxiliary overhead.

Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi LeeWed, 11 Ma💬 cs.CL

DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval

The paper proposes DEO, a training-free method that optimizes query embeddings through decomposition and contrastive objectives to significantly improve negation-aware text and multimodal retrieval without requiring additional model fine-tuning or data.

Taegyeong Lee, Jiwon Park, Seunghyun Hwang, JooYoung JangWed, 11 Ma💬 cs.CL

Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety

This paper introduces a "Bioalignment" framework to measure and mitigate LLM biases favoring synthetic solutions over biological ones, demonstrating that targeted fine-tuning on a curated corpus of biological literature significantly increases models' preference for bio-based approaches without compromising general capabilities.

Trent R Northen, Mingxun WangWed, 11 Ma💬 cs.CL

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

This paper systematically diagnoses the performance gap between text and image inputs in multimodal LLMs, revealing that visual text primarily amplifies reading errors rather than reasoning failures, and proposes a self-distillation method that effectively bridges this gap by training models on their own text-based reasoning traces paired with image inputs.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan BaiWed, 11 Ma💬 cs.CL

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

This paper proposes a confidence-aware self-consistency framework that adaptively selects between single-path and multi-path reasoning based on features from a single trajectory, achieving comparable accuracy to multi-path baselines while reducing token usage by up to 80% without additional fine-tuning.

Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Marlin, Zhijun YinWed, 11 Ma💬 cs.CL

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance

This paper presents an automated thematic analysis framework that combines iterative codebook refinement with full provenance tracking to significantly improve the scalability, reproducibility, and expert alignment of qualitative clinical data analysis compared to existing baselines.

Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Joseph Skrovan, Mehak Beri, Hitakshi Modi, Andrew Well, Carlos M. Mery, Yan Zhang, Mia K. Markey, Ying DingWed, 11 Ma💬 cs.CL

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

This paper presents an empirical study mapping the interactions between model characteristics and prompt engineering strategies for Verilog code generation, revealing which trends generalize across diverse language models and benchmarks through controlled experiments.

Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh KarriWed, 11 Ma💻 cs

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

The paper introduces SciTaRC, an expert-authored benchmark demonstrating that current state-of-the-art AI models struggle significantly with scientific tabular questions requiring both deep language reasoning and complex computation due to a universal "execution bottleneck" where models fail to faithfully execute plans despite having correct strategies.

Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli, Shuxian Chen, Adarsh Prabhudesai, Rongkun Zhou, Elina Baral, Philipp KoehnWed, 11 Ma💬 cs.CL

ConFu: Contemplate the Future for Better Speculative Sampling

This paper introduces ConFu, a novel speculative decoding framework that enhances draft model accuracy by enabling future anticipation through contemplate tokens and soft prompts, thereby achieving an 8–11% improvement in token acceptance rates and generation speed over state-of-the-art methods like EAGLE-3.

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou SunWed, 11 Ma💬 cs.CL

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

The paper introduces MultiGraSCCo, a multilingual benchmark containing over 2,500 annotated personal identifiers across ten languages, which was created using culturally adapted machine translation of synthetic data to facilitate the development and evaluation of anonymization systems while bypassing privacy regulations associated with real patient data.

Ibrahim Baroud, Christoph Otto, Vera Czehmann, Christine Hovhannisyan, Lisa Raithel, Sebastian Möller, Roland RollerWed, 11 Ma💬 cs.CL

← Previous Next →