cs.CL papers | Gist.Science

KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

KrishokBondhu is a voice-enabled, call-center-integrated advisory system for Bengali farmers in Bangladesh that leverages a Retrieval-Augmented Generation (RAG) framework to deliver high-quality, real-time agricultural guidance, achieving a 44.7% performance improvement over existing benchmarks.

Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman RafatTue, 10 Ma💬 cs.CL

HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

The paper introduces HypoSpace, a diagnostic suite that evaluates large language models as set-valued hypothesis generators in underdetermined scientific domains by measuring validity, uniqueness, and recovery to reveal mode collapse that traditional correctness-only metrics fail to detect.

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, Dianbo LiuTue, 10 Ma💬 cs.CL

R-WoM: Retrieval-augmented World Model For Computer-use Agents

This paper proposes R-WoM, a retrieval-augmented world model that mitigates the hallucination and long-horizon planning limitations of Large Language Models in computer-use agents by integrating factual external knowledge, thereby significantly improving performance in state prediction and task execution.

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong JiangTue, 10 Ma💬 cs.CL

ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall

The paper proposes ACE, a knowledge editing framework that improves multi-hop factual recall by identifying and editing critical query-value neuron pathways responsible for dynamically accumulating information across transformer layers, thereby outperforming existing state-of-the-art methods.

Jiayu Yang, Yuxuan Fan, Songning Lai, Shengen Wu, Jiaqi Tang, Chun Kang, Zhijiang Guo, Yutao YueTue, 10 Ma💬 cs.CL

Idiom Understanding as a Tool to Measure the Dialect Gap

This paper introduces three new benchmark datasets for French idioms to demonstrate that while large language models perform well on Metropolitan French, they exhibit a significant dialect gap by struggling with Quebec French, thereby establishing regional idiom understanding as a reliable metric for measuring dialectal competence disparities.

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard KhouryTue, 10 Ma💬 cs.CL

FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

The paper introduces FOR-Prompting, a model-agnostic, asymmetric prompting protocol that enhances reasoning and iterative refinement across diverse tasks by structuring interactions between a Defender, a Questioner, and an optional Host, enabling even small models to achieve performance comparable to or better than standard baselines without requiring training or access to model internals.

He Zhang, Anzhou Zhang, Jian DaiTue, 10 Ma💬 cs.CL

TokMem: One-Token Procedural Memory for Large Language Models

TokMem introduces a procedural memory framework that compiles reusable task procedures into single trainable tokens to steer large language model generation with constant overhead, enabling continual addition of new behaviors while outperforming retrieval-augmented prompting and matching parameter-efficient fine-tuning with fewer parameters.

Zijun Wu, Yongchang Hao, Lili MouTue, 10 Ma💬 cs.CL

PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space

PonderLM-2 introduces a novel pretraining methodology where language models generate intermediate latent thoughts in continuous space before predicting each token, enabling a 1.4B model to outperform a standard 2.8B model at identical inference costs by effectively scaling computational steps during pretraining.

Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Zitong Wang, Ziwei He, Xinbing Wang, Zhouhan LinTue, 10 Ma💬 cs.CL

OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis

The paper proposes OTESGN, a novel aspect-based sentiment analysis model that integrates syntactic graph attention with semantic optimal transport to effectively capture nonlinear associations and suppress noise, achieving state-of-the-art performance on multiple benchmark datasets.

Xinfeng Liao, Xuanqi Chen, Lianxi Wang, Jiahuan Yang, Zhuowei Chen, Ziying RongTue, 10 Ma💬 cs.CL

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

MathSmith is a novel framework that synthesizes high-difficulty mathematical problems from scratch using PlanetMath concepts and reinforcement learning to overcome data scarcity, thereby significantly enhancing large language models' reasoning capabilities across diverse benchmarks.

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei TanTue, 10 Ma💬 cs.CL

Goal Alignment in LLM-Based User Simulators for Conversational AI

This paper introduces User Goal State Tracking (UGST), a novel framework and three-stage methodology that enables LLM-based user simulators to autonomously track goal progression and generate goal-aligned responses, significantly improving performance on MultiWOZ 2.4 and $\tau$ -Bench benchmarks.

Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-TürTue, 10 Ma💬 cs.CL

A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

This paper introduces MeRF, a method that enhances reinforcement finetuning of large reasoning models by injecting reward specifications directly into prompts as "motivation," thereby leveraging in-context learning to align generation with optimization objectives and achieve substantial performance gains over standard RLVR baselines.

Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng TaoTue, 10 Ma💬 cs.CL

CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling

CyclicReflex is a training-free decoding strategy that improves the test-time performance of large reasoning models by adaptively scheduling reflection tokens using a bidirectional triangular waveform, effectively balancing over- and under-reflection to achieve consistent gains across various benchmarks and model sizes.

Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia LiuTue, 10 Ma💬 cs.CL

SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving

SwingArena is a competitive evaluation framework that simulates real-world software development workflows by pairing LLMs as patch submitters and reviewers within a CI-driven pipeline, utilizing a retrieval-augmented code generation module to effectively solve long-context GitHub issues across multiple programming languages.

Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang, Hui Shen, Zhongwei Wan, Jianbo Dai, Taiqiang Wu, He Xiao, Chaofan Tao, Z. Morley Mao, Ying Sheng, Zhijiang Guo, Hongxia Yang, Bei Yu, Lingpeng Kong, Quanquan Gu, Ngai WongTue, 10 Ma💬 cs.CL

Causal Retrieval with Semantic Consideration

The paper introduces CAWAI, a retrieval model trained with dual semantic and causal objectives to overcome the limitations of existing systems in capturing causal relationships, demonstrating superior performance in large-scale causal retrieval and zero-shot generalization across scientific domains.

Hyunseo Shin, Wonseok HwangTue, 10 Ma💬 cs.CL

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

This paper reveals that while Large Language Models overrepresent female characters due to fine-tuning, they paradoxically still reinforce traditional occupational gender stereotypes more than real-world labor data, highlighting the need for nuanced bias mitigation strategies.

Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan ChenTue, 10 Ma💬 cs.CL

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

This paper proposes HaLoRA, a hardware-aware low-rank adaptation method that enhances the robustness of Large Language Models deployed on hybrid Compute-in-Memory architectures by training noise-resilient LoRA branches on SRAM while storing pretrained weights on noisy RRAM, thereby achieving significant energy efficiency and up to a 22.7-point performance improvement without compromising accuracy.

Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Wendong Xu, Chufan Shi, Zhengwu Liu, Ngai WongTue, 10 Ma💬 cs.CL

A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation

This paper proposes a pivot-based single-model ensemble framework that generates diverse translation candidates through pivot languages and aggregates the best ones to improve translation quality for low-resource language pairs while avoiding the high computational costs of traditional multi-model ensembles.

Seokjin Oh, Keonwoong Noh, Woohwan JungTue, 10 Ma💬 cs.CL

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

This paper introduces HarmonicEval, a reference-free, multi-criteria evaluation metric for vision-language models that aggregates criterion-wise scores to better align with human judgments across diverse multi-modal tasks, supported by the newly constructed MMHE benchmark containing 18,000 expert human evaluations.

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa InoueTue, 10 Ma💬 cs.CL

Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

This paper introduces a Discrete Key-Value Bottleneck (DKVB) for encoder-only small language models that enables efficient continual learning by alleviating catastrophic forgetting through localized updates and task-independent initialization, achieving competitive performance with lower computational costs even in challenging single-head scenarios.

Andor Diera, Lukas Galke, Fabian Karl, Ansgar ScherpTue, 10 Ma💬 cs.CL

← Previous Next →