Hindsight Credit Assignment for Long-Horizon LLM Agents
The paper introduces HCAPO, a novel framework that enhances long-horizon LLM agents by leveraging hindsight reasoning to refine step-level Q-values and employing a multi-scale advantage mechanism to address sparse reward challenges, thereby significantly outperforming state-of-the-art methods like GRPO on benchmarks such as WebShop and ALFWorld.