cs.CL papers | Gist.Science

Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

This paper proposes a dynamic knowledge fusion framework for multi-domain dialogue state tracking that addresses challenges in modeling dialogue history and data scarcity by using a contrastive learning-based encoder to select relevant slots and leveraging their structured information as contextual prompts to improve tracking accuracy and generalization.

Haoxiang Su, Ruiyu Fang, Liting Jiang, Xiaomeng Huang, Shuangyong SongThu, 12 Ma💬 cs.CL

Speech Codec Probing from Semantic and Phonetic Perspectives

This paper systematically analyzes widely used speech tokenizers and reveals that they primarily encode phonetic rather than lexical-semantic information, highlighting a critical mismatch with text-derived semantics that necessitates new design approaches for effective multimodal LLM integration.

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth NarayananThu, 12 Ma⚡ eess

Aligning Large Language Models with Searcher Preferences

This paper introduces SearchLLM, the first large language model designed for open-ended generative search on platforms like RedNote, which utilizes a hierarchical multi-dimensional reward system and Gated Aggregation Strategy with GRPO to balance safety, factual grounding, and user alignment, resulting in measurable improvements in generation quality and user engagement.

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui XiongThu, 12 Ma💬 cs.CL

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

This paper introduces a multi-agent negotiation framework that trains large language models to align with Collective Agency and resolve value conflicts through self-play deliberation optimized via RLAIF with GRPO, demonstrating improved conflict-resolution capabilities without compromising general language performance.

Panatchakorn Anantaprayoon, Nataliia Babina, Nima Asgharbeygi, Jad TarifiThu, 12 Ma💬 cs.CL

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

This paper introduces PEEM, a unified framework that employs a structured 9-axis rubric and LLM-based evaluators to provide interpretable, joint assessments of prompts and responses, enabling systematic diagnosis and significantly improving downstream accuracy through zero-shot prompt optimization.

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie KimThu, 12 Ma💬 cs.CL

Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

The paper introduces PULSE, a medical reasoning agent that integrates a domain-tuned large language model with scientific literature retrieval to achieve expert-competitive diagnostic accuracy across varying disease incidences, while demonstrating both its potential to enhance physician decision-making and the risks of automation bias in collaborative workflows.

Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu LiThu, 12 Ma💬 cs.CL

VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

The paper introduces VERI-DPO, an evidence-aware alignment framework that leverages claim verification to mine preference pairs for Direct Preference Optimization, significantly reducing unsupported claims and improving the faithfulness of clinical summarizations while maintaining informative length.

Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu, Bradley A. Malin, Zhijun YinThu, 12 Ma💬 cs.CL

Safe and Scalable Web Agent Learning via Recreated Websites

This paper introduces VeriEnv, a framework that leverages language models to automatically clone real-world websites into safe, verifiable synthetic environments, enabling autonomous web agents to perform scalable self-evolution and achieve strong generalization without relying on unsafe real-world interactions or heuristic reward signals.

Hyungjoo Chae, Jungsoo Park, Alan RitterThu, 12 Ma💬 cs.CL

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

The paper introduces IH-Challenge, a reinforcement learning dataset designed to enhance instruction hierarchy robustness in frontier LLMs, which significantly improves their ability to prioritize instructions against conflicts and adversarial attacks while maintaining helpfulness and minimizing capability regression.

Chuan Guo (Michael Pokorny), Juan Felipe Ceron Uribe (Michael Pokorny), Sicheng Zhu (Michael Pokorny), Christopher A. Choquette-Choo (Michael Pokorny), Steph Lin (Michael Pokorny), Nikhil Kandpal (Michael Pokorny), Milad Nasr (Michael Pokorny), Rai (Michael Pokorny), Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai XiaoThu, 12 Ma🤖 cs.AI

AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

The AILS-NTUA system achieves top rankings in SemEval-2026 Task 8 by employing a unified architecture that prioritizes query diversity over retriever diversity for passage retrieval and utilizes a multistage generation pipeline with answerability calibration to optimize response quality.

Dimosthenis Athanasiou, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos StamouThu, 12 Ma💬 cs.CL

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

This paper introduces Group Relative Reward Rescaling (GR $^3$ ), a novel reinforcement learning method that effectively mitigates length inflation in large language models by reframing length control as a multiplicative rescaling paradigm, thereby achieving lossless optimization and superior performance compared to existing baselines without compromising downstream capabilities.

Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, Xing YuThu, 12 Ma🤖 cs.LG

Automatic End-to-End Data Integration using Large Language Models

This paper introduces an automatic end-to-end data integration pipeline powered by GPT-5.2 that generates all necessary configuration artifacts, demonstrating comparable or superior performance to human-designed pipelines across multiple case studies at a significantly lower cost.

Aaron Steiner, Christian BizerThu, 12 Ma💬 cs.CL

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

This paper proposes a scalable, end-to-end automatic evaluation framework that generates Q&A pairs from knowledge bases, leverages LLMs for response assessment, and employs uncertainty filtering to significantly reduce human review costs while maintaining high agreement with manual judgments.

Nhi Dang, Tung Le, Huy Tien NguyenThu, 12 Ma💬 cs.CL

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

This paper empirically demonstrates that contrary to the hypothesis that moral reasoning alignment requires diversity-seeking algorithms, standard reward-maximizing RLVR methods are equally or more effective because high-reward moral responses exhibit a concentrated distribution in semantic space similar to logical reasoning tasks.

Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang, Zhiyuan Feng, Yaodong Yang, Xiaoyuan Yi, Xing XieThu, 12 Ma🤖 cs.AI

MUNIChus: Multilingual News Image Captioning Benchmark

The paper introduces MUNIChus, the first multilingual benchmark for news image captioning comprising nine languages including low-resource ones, to address the scarcity of non-English datasets and evaluate the performance of state-of-the-art models in this challenging task.

Yuji Chen, Alistair Plum, Hansi Hettiarachchi, Diptesh Kanojia, Saroj Basnet, Marcos Zampieri, Tharindu RanasingheThu, 12 Ma💬 cs.CL

Disentangling Similarity and Relatedness in Topic Models

This paper introduces a neural scoring function trained on an LLM-annotated synthetic benchmark to disentangle taxonomic similarity and thematic relatedness in topic models, demonstrating that these distinct semantic dimensions not only characterize differences across model families but also predict downstream task performance.

Hanlin Xiao, Mauricio A. Álvarez, Rainer BreitlingThu, 12 Ma💬 cs.CL

Reinforcement Learning with Conditional Expectation Reward

This paper proposes Conditional Expectation Reward (CER), a novel reinforcement learning method that utilizes the large language model itself as an implicit verifier to provide soft, graded reward signals, thereby overcoming the limitations of rule-based verification and enabling effective reasoning training across both mathematical and general free-form answer domains.

Changyi Xiao, Caijun Xu, Yixin CaoThu, 12 Ma🤖 cs.LG

Making Bielik LLM Reason (Better): A Field Report

This paper outlines a research program focused on evaluating and enhancing the reasoning capabilities of the Polish large language model Bielik through benchmarking, comparative analysis, and strategic planning to ensure its competitiveness in the evolving AI landscape.

Adam Trybus, Bartosz Bartnicki, Remigiusz KinasThu, 12 Ma💬 cs.CL

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

The paper introduces DxEvolve, a self-evolving diagnostic agent that emulates clinician cognition through an interactive deep clinical research workflow, autonomously requisitioning examinations and externalizing experience to achieve superior diagnostic accuracy and governed continual improvement compared to existing AI models.

Ruiyang Ren, Yuhao Wang, Yunsen Liang, Lan Luo, Jing Liu, Haifeng Wang, Cong Feng, Yinan Zhang, Chunyan Miao, Ji-Rong Wen, Wayne Xin ZhaoThu, 12 Ma🤖 cs.AI

EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution

This paper introduces EvoSchema, a comprehensive benchmark featuring a novel taxonomy of ten schema perturbation types to evaluate and enhance the robustness of text-to-SQL models against real-world database schema evolution, revealing that table-level changes significantly impact performance and demonstrating that training on diverse schema designs improves model resilience.

Tianshu Zhang, Kun Qian, Siddhartha Sahai, Yuan Tian, Shaddy Garg, Huan Sun, Yunyao LiThu, 12 Ma💬 cs.CL

← Previous Next →