Relating Reinforcement Learning to Dynamic Programming-Based Planning
This paper bridges the gap between dynamic programming-based planning and reinforcement learning by developing a derandomized RL variant, mathematically analyzing the conditions under which their differing formulations (such as cost minimization versus reward maximization and goal termination versus infinite-horizon discounting) are equivalent, and advocating for the optimization of true cost over arbitrary parameters.