Improving Search Agent with One Line of Code
This paper introduces Search Agent Policy Optimization (SAPO), a method that resolves catastrophic training instability in Tool-based Agentic Reinforcement Learning by applying a conditional token-level KL constraint to prevent Importance Sampling Distribution Drift, achieving significant performance gains with only a single line of code modification to standard GRPO.