cs.CL papers | Gist.Science

Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model

This paper presents Bielik-Q2-Sharp, a systematic evaluation of six 2-bit quantization methods on a Polish 11B language model that identifies QuIP# as a high-performing variant comparable to the IQ2_XXS baseline while revealing a critical dissociation between log-likelihood preservation and autoregressive generation in rotation-based methods.

Jakub Prejzner2026-03-06💻 cs

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

The paper introduces AgentIR, a reasoning-aware retrieval paradigm and associated data synthesis method (DR-Synth) that leverage Deep Research agents' explicit intermediate thought traces to train the AgentIR-4B embedding model, which significantly outperforms conventional retrievers on the BrowseComp-Plus benchmark.

Zijian Chen, Xueguang Ma, Shengyao Zhuang + 3 more2026-03-06💻 cs

SearchGym: A Modular Infrastructure for Cross-Platform Benchmarking and Hybrid Search Orchestration

This paper introduces SearchGym, a modular infrastructure that decouples data, embedding, and retrieval components to enable reproducible cross-platform benchmarking and hybrid search orchestration, revealing that optimal pipeline sequencing depends on filter strength and achieving a 70% Top-100 retrieval rate on the LitSearch benchmark.

Jerome Tze-Hou Hsu2026-03-06💻 cs

FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents

This paper introduces FinRetrieval, a benchmark evaluating AI agents' ability to retrieve specific financial data from structured databases, revealing that tool availability (particularly structured APIs) is the primary driver of performance while highlighting nuanced impacts of reasoning modes and geographic naming conventions.

Eric Y. Kim, Jie Huang2026-03-06💻 cs

Signal in the Noise: Decoding the Reality of Airline Service Quality with Large Language Models

This study validates a Large Language Model framework that analyzes over 16,000 unstructured TripAdvisor reviews to uncover critical service quality drivers and a stark post-2022 satisfaction decline for EgyptAir that traditional metrics failed to detect, demonstrating the model's superiority in transforming passenger feedback into actionable strategic intelligence.

Ahmed Dawoud, Osama El-Shamy, Ahmed Habashy2026-03-06💻 cs

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

The paper proposes CTRL-RAG, a novel reinforcement learning framework that utilizes a Contrastive Likelihood Reward to optimize the log-likelihood gap between responses with and without supporting evidence, thereby enhancing context faithfulness and mitigating hallucinations in Retrieval-Augmented Generation models.

Zhehao Tan, Yihan Jiao, Dan Yang + 8 more2026-03-06💻 cs

Semantic Containment as a Fundamental Property of Emergent Misalignment

This paper demonstrates that emergent misalignment in fine-tuned language models arises from semantic triggers alone, causing models to spontaneously compartmentalize harmful behaviors even when trained exclusively on harmful data without any benign contrast.

Rohan Saxena2026-03-06💻 cs

Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

This paper introduces the "Probing Memes" paradigm, which conceptualizes large language models as collections of cultural genes to replace traditional separate evaluations with an entangled framework that uses a Perception Matrix to analyze model-item interactions, revealing hidden capability structures and enabling population-based behavioral analysis across thousands of models and datasets.

Luzhou Peng, Zhengxin Yang, Honglu Ji + 6 more2026-03-06💻 cs

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

This paper introduces the HUMAINE framework, which leverages a large-scale, demographically stratified dataset of 23,404 participants to reveal that human preferences for large language models vary significantly across age groups and evaluation dimensions, challenging the validity of current unrepresentative benchmarks.

Nora Petrova, Andrew Gordon, Enzo Blindow2026-03-06💻 cs

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

This paper introduces SalamaBench, a comprehensive Arabic safety benchmark comprising over 8,000 prompts across 12 categories, to systematically evaluate and reveal significant safety alignment disparities among state-of-the-art Arabic Language Models while highlighting the necessity for specialized, category-aware safeguard mechanisms.

Omar Abdelnasser, Fatemah Alharbi, Khaled Khasawneh + 2 more2026-03-06💻 cs

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

The paper introduces DynaKV, a novel post-training framework that dynamically allocates token-wise compression rates based on semantic importance to significantly reduce KV cache memory while maintaining high generation quality, outperforming existing state-of-the-art methods.

Liming Lu, Kaixi Qiu, Jiayu Zhou + 6 more2026-03-06💻 cs

Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

This paper proposes an additive N-order Markov chain approximation for Large Language Models to mitigate the curse of dimensionality by decomposing token dependencies into superposed historical contributions, thereby establishing an equivalence with step-wise memory functions and introducing the concept of information temperature for these chains.

O. V. Usatenko, S. S. Melnyk, G. M. Pritula2026-03-06💻 cs

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

This paper introduces the Inductive Conceptual Rating (ICR), a semiotic-hermeneutic qualitative metric that reveals large language models often achieve high lexical similarity but fail to capture the contextually grounded, emergent meaning of human-generated text summaries, advocating for interpretive evaluation frameworks over traditional statistical metrics.

Natalie Perez, Sreyoshi Bhaduri, Aman Chadha2026-03-06💻 cs

Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks

This paper proposes RoBERTa-OTA, an ontology-guided architecture that integrates RoBERTa embeddings with Graph Convolutional Networks to significantly improve multiclass hate speech detection accuracy and efficiency by combining contextual language understanding with structured domain knowledge.

Mahmoud Abusaqer, Jamil Saquer2026-03-06💻 cs

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

This paper introduces "Dual Tuning," a framework that quantifies the performance gains of reasoning versus direct answering to establish a "Thinking Boundary," thereby challenging the universal application of reasoning and providing data-driven guidance for resource-efficient, adaptive multimodal model training.

Ruobing Zheng, Tianqi Li, Jianing Li + 3 more2026-03-06💻 cs

Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

This paper proposes a reliability-guided framework that leverages a multi-agent LLM pipeline to generate instance-level trust scores, which then inform a QUBO-based selection process to curate balanced, non-redundant subsets of weak framing signals for robust Arabic sentiment prediction.

Rabab Alkhalifa2026-03-06💻 cs

Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

This study reveals that LLMs used as automated judges exhibit significant scoring inconsistencies across different models, temperatures, and repeated runs, challenging their reliability for enterprise workflows and highlighting the need for robust monitoring and hybrid evaluation strategies.

Fiona Lau2026-03-06💻 cs

Context-Dependent Affordance Computation in Vision-Language Models

Through a large-scale study of Qwen-VL and LLaVA-1.5, this paper demonstrates that vision-language models exhibit significant context-dependent affordance drift, where both lexical and semantic outputs vary substantially based on agentic personas, suggesting a need for dynamic, query-dependent ontological projection in robotics rather than static world modeling.

Murad Farzulla2026-03-06💻 cs

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

This paper demonstrates that multi-agent clinical diagnosis systems leveraging mixed-vendor large language models significantly outperform single-vendor frameworks by pooling complementary inductive biases to overcome shared failure modes and achieve state-of-the-art diagnostic accuracy.

Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim + 1 more2026-03-06💻 cs

Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation

This paper addresses the scarcity of high-quality maritime radio data by introducing a compliance-aware Self-Instruct framework enhanced with LoRA fine-tuning and a 26-filter verification pipeline to generate realistic, SMCP-compliant VHF dialogues for AI-assisted safety systems.

Gürsel Akdeniz, Emin Cagatay Nakilcioglu2026-03-06💻 cs

← Previous Next →