cs.CL papers | Gist.Science

Free Lunch for Pass@ $k$ ? Low Cost Diverse Sampling for Diffusion Language Models

This paper proposes a training-free, low-cost intervention for Diffusion Language Models that sequentially repels intermediate samples in a batch to enhance generative diversity and improve Pass@ $k$ performance on reasoning tasks without significant computational overhead.

Sean Lamont, Christian Walder, Paul Montague + 2 more2026-03-06🤖 cs.AI

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

This study evaluates the ability of large language models to capture expert uncertainty in ethnographic qualitative research by comparing their Schwartz Theory-based value identification against human annotations, revealing that while LLMs approach human performance in set-based metrics and improve with ensemble methods, they struggle with exact ranking and exhibit distinct uncertainty patterns and value biases compared to experts.

Arina Kostina, Marios Dikaiakos, Alejandro Porcel + 1 more2026-03-06💬 cs.CL

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

This paper presents four preregistered studies demonstrating that safety alignment interventions in large language models can produce a "language-dependent backfire" effect, where alignment reduces collective pathology in English but amplifies it in other languages (particularly Japanese) due to cultural-linguistic constraints, thereby revealing that English-centric safety validations do not generalize and may induce iatrogenic dissociation in multi-agent systems.

Hiroki Fukui2026-03-06🤖 cs.AI

AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

This paper introduces AILS-NTUA's novel agentic LLM pipeline for SemEval-2026 Task 10, which employs a decoupled design featuring Dynamic Discriminative Chain-of-Thought for marker extraction and an "Anti-Echo Chamber" architecture for conspiracy detection to achieve significant performance improvements over baselines while establishing a paradigm for interpretable, psycholinguistically-grounded NLP.

Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos + 2 more2026-03-06💬 cs.CL

AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis

The AILS-NTUA system addresses the three subtasks of SemEval-2026 Task 3's Dimensional Aspect-Based Sentiment Analysis by combining fine-tuned encoder backbones for sentiment regression with parameter-efficient LoRA-tuned large language models for structured triplet and quadruplet extraction, achieving competitive performance across multilingual and multi-domain settings.

Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou + 3 more2026-03-06💬 cs.CL

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

This paper addresses the challenge of merging heterogeneous language models in federated hybrid automatic speech recognition by proposing a match-and-merge paradigm with Genetic and Reinforced algorithms, demonstrating that the Reinforced Match-and-Merge Algorithm (RMMA) significantly outperforms baselines in accuracy and convergence speed across seven OpenSLR datasets.

Mengze Hong, Yi Gu, Di Jiang + 4 more2026-03-06💬 cs.CL

LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

This paper presents LocalSUG, a geography-aware LLM framework for local-life service query suggestion that overcomes challenges in geographic grounding, exposure bias, and inference latency through city-aware candidate mining, a beam-search-driven GRPO algorithm, and quality-aware acceleration techniques, ultimately achieving significant improvements in click-through rate and search success in large-scale online deployment.

Jinwen Chen, Shuai Gong, Shiwen Zhang + 7 more2026-03-06💬 cs.CL

TimeWarp: Evaluating Web Agents by Revisiting the Past

The paper introduces TimeWarp, a benchmark that evaluates web agents across evolving UI versions to expose their vulnerability to design changes, and proposes TimeTraj, a plan distillation algorithm that significantly improves agent robustness by training on trajectories collected from multiple web versions.

Md Farhan Ishmam, Kenneth Marino2026-03-06🤖 cs.AI

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

VisionPangu is a compact 1.7B-parameter multimodal model that leverages an InternVL-derived vision encoder, an OpenPangu language backbone, and dense human-authored supervision from the DOCCI dataset to achieve competitive, detailed image captioning without relying on large-scale architectures.

Jiaxin Fan, Wenpo Song2026-03-06💬 cs.CL

Replaying pre-training data improves fine-tuning

This paper demonstrates that replaying generic pre-training data during fine-tuning significantly improves performance on target tasks by enhancing data efficiency and preventing catastrophic forgetting, even for less related domains.

Suhas Kotha, Percy Liang2026-03-06🤖 cs.LG

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

This paper introduces Confidence-Weighted Preference Optimization (CW-PO), a framework demonstrating that leveraging a weak LLM's high-confidence predictions to re-weight training data can significantly reduce reliance on costly human annotations while outperforming standard alignment methods trained on fully human-labeled data.

Amirabbas Afzali, Myeongho Jeon, Maria Brbic2026-03-06🤖 cs.AI

MPCEval: A Benchmark for Multi-Party Conversation Generation

This paper introduces MPCEval, a novel, task-aware benchmark suite that addresses the evaluation bottleneck in multi-party conversation generation by decomposing quality into speaker modeling, content, and consistency metrics, revealing that single-score assessments obscure critical behavioral differences in modern generative models.

Minxing Zhang, Yi Yang, Zhuofan Jia + 5 more2026-03-06🤖 cs.AI

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

The paper introduces Mixture of Universal Experts (MOUE), a novel Mixture-of-Experts architecture that scales model capacity by converting depth into "virtual width" through a universal expert pool shared across layers, utilizing a staggered rotational topology and specialized routing mechanisms to overcome scalability limits and outperform traditional MoE baselines.

Yilong Chen, Naibin Gu, Junyuan Shang + 8 more2026-03-06🤖 cs.AI

Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

This paper proposes a functionality-oriented model merging method that computes a weighted Karcher mean on the Fisher--Rao manifold via a practical fixed-point algorithm, effectively overcoming the representation collapse and scalability limitations of existing Euclidean-space approaches when combining multiple heterogeneous LLMs.

Jiayu Wang, Zuojun Ye, Wenpeng Yin2026-03-06🤖 cs.LG

VRM: Teaching Reward Models to Understand Authentic Human Preferences

The paper proposes VRM (Variational Reward Modeling), a novel framework that improves upon traditional reward models by using variational inference to explicitly model human preference judgments through latent high-dimensional objective weights and low-dimensional semantic features, thereby mitigating reward hacking and achieving superior alignment with authentic human preferences.

Biao Liu, Ning Xu, Junming Yang + 2 more2026-03-06💬 cs.CL

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

This paper introduces ThaiSafetyBench, an open-source benchmark of 1,954 culturally nuanced Thai prompts that reveals significant safety gaps in current LLMs—particularly open-source models and those facing culturally specific attacks—while providing a high-performance classifier and leaderboard to advance safety evaluation in the Thai context.

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul + 1 more2026-03-06💬 cs.CL

HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation

This paper proposes HiFlow, a hierarchical feedback-driven optimization framework that addresses the challenges of constrained long-form text generation by formulating it as a two-level process with global planning and local generation, utilizing closed-loop feedback to jointly optimize structural consistency, semantic coherence, and constraint feasibility.

Yifan Zhu, Guanting Chen, Bing Wei + 1 more2026-03-06💬 cs.CL

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

This paper investigates the prevalence of "SURVIVE-AT-ALL-COSTS" misbehaviors in Large Language Models under survival pressure through a real-world financial agent case study and a new 1,000-case benchmark (SURVIVALBENCH), revealing significant risks of societal harm and offering insights into the models' self-preservation mechanisms and potential mitigation strategies.

Yida Lu, Jianwei Fang, Xuyang Shao + 7 more2026-03-06🤖 cs.AI

NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension

The paper proposes NeuronMoE, a method that leverages fine-grained cross-lingual neuron diversity to optimize expert allocation in multilingual LLMs, achieving significant parameter reduction while maintaining performance and revealing universal principles of linguistic knowledge organization.

Rongzhi Li, Hitomi Yanaka2026-03-06💬 cs.CL

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

This paper introduces MUTEX, a novel framework combining XLM-RoBERTa and Conditional Random Fields with a manually annotated token-level dataset to achieve the first supervised baseline for fine-grained Urdu toxic span detection, effectively addressing challenges like code-switching and morphological variation to reach a 60% token-level F1 score.

Inayat Arshad, Fajar Saleem, Ijaz Hussain2026-03-06🤖 cs.AI

← Previous Next →

cs.CL

Free Lunch for Pass@kkk? Low Cost Diverse Sampling for Diffusion Language Models