Improving Search Agent with One Line of Code

This paper introduces Search Agent Policy Optimization (SAPO), a method that resolves catastrophic training instability in Tool-based Agentic Reinforcement Learning by applying a conditional token-level KL constraint to prevent Importance Sampling Distribution Drift, achieving significant performance gains with only a single line of code modification to standard GRPO.

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao WangThu, 12 Ma🤖 cs.LG

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

The paper introduces CLIPO, a method that integrates contrastive learning into policy optimization to generalize Reinforcement Learning with Verifiable Rewards (RLVR) by capturing invariant structures across correct reasoning paths, thereby mitigating hallucinations and improving the generalization and robustness of Large Language Models.

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun JiangThu, 12 Ma🤖 cs.LG

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

This paper proposes a unified framework for the generation-recognition asymmetry in formal language theory by identifying six distinct dimensions of divergence, challenging the oversimplified view that generation is inherently easy while parsing is hard, and exploring the implications of these operational differences for fields ranging from compiler design to large language models.

Romain PeyrichouThu, 12 Ma💬 cs.CL

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

This paper proposes ReMix, a novel Mixture-of-LoRAs framework that employs non-learnable routing weights and a Reinforce Leave-One-Out (RLOO) gradient estimator to prevent routing imbalance, thereby ensuring all active LoRAs contribute equally and significantly outperforming state-of-the-art parameter-efficient finetuning methods.

Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang TongThu, 12 Ma🤖 cs.LG

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL is an asynchronous framework that enables a single agent policy to continuously improve across diverse interaction domains (such as personal conversations, terminal, and GUI tasks) by simultaneously learning from universal next-state signals through both scalar rewards and token-level directional advantages derived via Hindsight-Guided On-Policy Distillation.

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling YangThu, 12 Ma💬 cs.CL

Video-Based Reward Modeling for Computer-Use Agents

This paper introduces the Execution Video Reward Model (ExeVRM), a scalable and model-agnostic framework that leverages a new 53k video-task-reward dataset and spatiotemporal token pruning to accurately assess computer-using agent trajectories from execution videos, outperforming leading proprietary models in task success prediction.

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu ZhaoThu, 12 Ma💬 cs.CL

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

This paper introduces Adaptive Activation Cancellation (AAC), a real-time, training-free inference framework that mitigates hallucinations in large language models by identifying and suppressing hallucination-associated neural activations as structured interference, thereby improving factual accuracy across multiple model scales without degrading general capabilities or fluency.

Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang, Judith L. MwakalongeThu, 12 Ma💬 cs.CL

Sabiá-4 Technical Report

This technical report introduces Sabi'a-4 and Sabiazinho-4, a new generation of Brazilian Portuguese language models featuring a 128K token context window and specialized training in legal and agentic tasks, which demonstrate superior cost-performance and capabilities in legal drafting, dialogue, and tool use compared to previous generations.

Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau, Celio Larcher, Ramon Pires, Rodrigo NogueiraThu, 12 Ma💬 cs.CL

Large language models can disambiguate opioid slang on social media

This paper demonstrates that large language models significantly outperform traditional lexicon-based strategies in accurately disambiguating ambiguous opioid slang and identifying relevant social media posts across lexicon-based, lexicon-free, and emergent slang scenarios, thereby enhancing the monitoring of the opioid overdose crisis.

Kristy A. Carpenter, Issah A. Samori, Mathew V. Kiang, Keith Humphreys, Anna Lembke, Johannes C. Eichstaedt, Russ B. AltmanThu, 12 Ma💬 cs.CL