Design Experiments to Compare Multi-armed Bandit Algorithms
This paper proposes "Artificial Replay," a new experimental design that reuses recorded rewards from a single policy trajectory to enable unbiased, cost-effective, and low-variance comparisons between multi-armed bandit algorithms, thereby significantly reducing the number of required user interactions compared to traditional independent restarts.