Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
This paper introduces VIP, a Variance-Informed Predictive allocation strategy that dynamically optimizes rollout distribution across training prompts using Gaussian process-based variance estimation to minimize gradient variance and significantly improve sampling efficiency in online reinforcement learning with verifiable rewards.