Relating Reinforcement Learning to Dynamic Programming-Based Planning

Imagine you are trying to teach a robot how to navigate a maze to find a treasure. There are two main schools of thought on how to do this, and this paper is like a translator trying to get them to speak the same language.

The Two Schools of Thought

1. The Engineer's Approach (Planning)
Think of this as a GPS navigation system.

How it works: You give the GPS a perfect map of the city. It knows exactly where the roads are, where the traffic lights are, and how long every street takes. It calculates the absolute shortest, fastest route before the car even starts moving.
The Goal: Minimize "cost" (time, gas, money).
The Vibe: Logical, precise, and deterministic. If you take the same route twice, you get the exact same result.

2. The Biologist's Approach (Reinforcement Learning - RL)
Think of this as training a dog.

How it works: You don't give the dog a map. You just put it in the maze. If it hits a wall, it gets a "zap" (negative reward). If it finds a treat, it gets a "treat" (positive reward). Over thousands of tries, the dog learns which turns lead to treats and which lead to zaps.
The Goal: Maximize "reward" (treats).
The Vibe: Experimental, messy, and stochastic (random). The dog might take a wrong turn today but learn from it tomorrow. It often uses "discounting," which is like telling the dog, "A treat you get right now is worth more than a treat you might get 10 minutes from now."

The Problem

For a long time, these two groups (Engineers and Biologists) didn't talk much. They used different math, different goals, and different assumptions. The paper argues that they are actually trying to solve the same problem, just with different tools. The authors want to bridge the gap so we can use the best of both worlds.

The Three Big Ideas in the Paper

1. The "Derandomized" Robot (Making RL behave like a Planner)

The authors created a special version of the "dog training" method where the robot is super disciplined.

The Analogy: Imagine a dog that is forced to try every possible path in the maze exactly once, in a specific order, without getting distracted. It's not guessing; it's methodically exploring.
The Result: They found that if you remove the randomness from RL, it behaves almost exactly like the classic "GPS" algorithms (like Dijkstra's algorithm). It's just as fast and finds the same perfect path. This proves that at their core, both methods are doing the same math.

2. The Danger of "Discounting" (The "I'll do it tomorrow" trap)

In standard RL, we use a "discount factor." This is like saying, "Future rewards are worth less than immediate ones."

The Analogy: Imagine you are trying to lose weight. If you have a discount factor, your brain might say, "Eating a salad today is great, but eating a salad next year doesn't matter as much." So, you might choose to eat a donut today because the "future health" reward feels too far away to care about.
The Paper's Warning: In a maze, this can be dangerous. The robot might get stuck in a loop (a cycle) because it thinks, "I'll just keep running in circles here because the reward is immediate, and I'll worry about the exit later." The paper argues that for robotics and engineering, we should stop using these arbitrary discounts and focus on "True Cost" (actual time or energy). If the goal is to reach the exit, the robot should care about the exit, not just the next step.

3. The "Reset Button" (Episodes vs. One-Shot)

RL usually works in "episodes." The robot tries to solve the maze, hits the goal, gets teleported back to the start, and tries again.

The Analogy: It's like playing a video game level over and over.
The Finding: The paper shows that if you set up the "teleporting" and the "bonus points" correctly, this endless loop of playing the game is mathematically equivalent to solving the maze just once to get to the goal. This means we can use the powerful "game-playing" AI techniques to solve single-shot engineering problems, provided we tune the rules right.

The Experiments: The Race

The authors ran thousands of simulations on grid-based mazes (like a giant chessboard).

The Contenders: They pitted the "GPS" (Value Iteration/Dijkstra) against the "Dog Trainer" (Q-Learning).
The Outcome:
- Speed: The "GPS" (Planning) was almost always much faster (sometimes 100x faster) than the "Dog Trainer" (RL). This makes sense; the GPS has the map, while the dog has to learn by trial and error.
- The Sweet Spot: However, the "Dog Trainer" could still find the right path if you tuned its "greediness" (how much it explores vs. how much it sticks to what it knows) and its "learning rate" (how fast it updates its memory) just right.
- Stochasticity: When they added "fog" to the maze (making the robot's movements slightly random, like a slippery floor), the GPS still worked well, but the Dog Trainer struggled more, needing even more careful tuning to avoid getting lost.

The Bottom Line

This paper is a call for honesty in AI design.

Don't fake it: Don't use "rewards" and "discounts" just to make the algorithm work. Use real-world costs (time, energy, distance).
Know your tool: If you have a map (a model of the world), use the fast, deterministic planning methods. If you don't have a map and have to learn by doing, use RL, but be aware that it will be slower and requires careful tuning.
They are cousins: Ultimately, Planning and RL are just different flavors of the same mathematical recipe (Dynamic Programming). By understanding their similarities, we can build better, more reliable robots that don't just "guess" their way to the goal, but actually understand the cost of their actions.

Here is a detailed technical summary of the paper "Relating Reinforcement Learning to Dynamic Programming-Based Planning" by Georgiev et al.

1. Problem Statement

The paper addresses the growing disconnect between optimal planning (rooted in classical control theory and dynamic programming) and Reinforcement Learning (RL). While both fields share roots in Bellman's dynamic programming, they have diverged in formulation and practice:

Planning: Typically uses deterministic models, minimizes cost (time, energy), employs goal termination (finite horizon), and relies on known models.
RL: Typically uses stochastic models, maximizes reward (often biologically inspired), employs infinite horizons with arbitrary discount factors, and learns via trial-and-error without a known model.

The authors argue that these differences are often superficial or heuristic. Specifically, they question the necessity of arbitrary discount factors in RL, the separation of cost minimization vs. reward maximization, and the handling of episodic vs. single-shot goals. The goal is to bridge these gaps to better understand when and why RL algorithms succeed or fail compared to classical planning algorithms.

2. Methodology

The authors employ a combination of theoretical mathematical analysis and extensive empirical experiments across deterministic and stochastic environments.

A. Theoretical Analysis

Derandomized RL: The authors introduce a "derandomized" version of Q-learning for deterministic systems. By setting the learning rate $\rho = 1$ (assuming no uncertainty), they show that Q-learning becomes a step of asynchronous value iteration. They prove that if every state-action pair is visited infinitely often, this deterministic Q-learning converges to the optimal solution in finite time.
Cost vs. Reward Equivalence: They prove that minimizing a linear cost functional is mathematically equivalent to maximizing a linear reward functional (where reward = -cost), provided the cost/reward function is linear in the sequence of steps.
The Dangers of Discounting: The paper provides a rigorous proof (Proposition 3) that using a discount factor $\alpha < 1$ in a goal-oriented problem can lead to infinite true cost. If a cycle exists with a lower discounted cost than the path to the goal, the optimal discounted policy will loop forever, failing to reach the goal even if it is reachable.
Episodic Equivalence: They analyze the relationship between single-shot goal termination (planning) and infinite-horizon episodic learning (RL with resets). They derive conditions under which an optimal policy for a single-shot problem remains optimal for an infinite-horizon problem with a "reset bonus" (negative cost upon reaching the goal).

B. Experimental Setup

Environments: A suite of grid-world planning problems (Problems 0–16) with varying dimensions, obstacle densities, and connectivity.
Algorithms Compared:
- Planning: Model-free Dijkstra, Asynchronous Value Iteration (VI), Synchronous VI.
- RL: Q-learning with varying parameters:
  - $\epsilon$ (greediness/exploration rate).
  - $\rho$ (learning rate).
  - $\gamma$ (predictability factor, simulating stochasticity where $\gamma=1$ is deterministic).
Metrics: Run time, number of actions taken, convergence to optimal cost-to-go, goal discovery time, and path optimality.

3. Key Contributions

Derandomized Q-Learning: The authors demonstrate that Q-learning is not inherently a stochastic algorithm but a generalization of value iteration. In deterministic settings, it converges to the optimal solution if the learning rate is set to 1 and exploration is sufficient.
Critique of Discounting: A major contribution is the formal demonstration that discounting is a heuristic that can cause goal failure. In goal-oriented robotics, discounting can cause an agent to prefer a local loop over reaching the goal, resulting in infinite true cost. The authors advocate for TrueCost (physical/monetary costs) and termination actions instead of discounting.
Equivalence of Formulations: The paper establishes that:
- Cost minimization and reward maximization are equivalent for linear functionals.
- Episodic infinite-horizon problems can be made equivalent to single-shot problems by tuning the "reset bonus" parameter, provided specific conditions on cycle costs are met.
Parameter Sensitivity Analysis: The study provides a comprehensive map of how $\epsilon$ , $\rho$ , and $\gamma$ affect convergence and performance, highlighting that standard RL hyperparameters often require significant tuning to match the efficiency of planning algorithms.

4. Results

Deterministic Results

Speed: Model-free Dijkstra and Value Iteration are orders of magnitude faster than Q-learning. For example, in Problem 10, Model-free Dijkstra was 134x faster than Q-learning with $\epsilon=0$ .
Convergence: Purely greedy Q-learning ( $\epsilon=0$ ) finds the goal fastest but often fails to converge to the optimal cost-to-go for the entire state space because it does not explore enough. High $\epsilon$ (random exploration) ensures convergence but drastically increases run time.
Action Count: Q-learning requires significantly more actions (often 20x–40x) to converge compared to planning algorithms that utilize the model (or model-free graph traversal).

Stochastic Results

Learning Rate ( $\rho$ ): In stochastic environments, a lower $\rho$ (e.g., 0.5) is often required for stability as the predictability factor $\gamma$ decreases. However, this slows convergence.
Convergence Failure: As stochasticity increases (lower $\gamma$ ), Q-learning struggles to converge to the global optimal cost-to-go for all states, even with high episode counts. In some cases (e.g., $\gamma=0.5$ ), global convergence was only achieved with decaying learning rates and high exploration.
Performance Gap: The performance gap between Dynamic Programming (DP) and RL widens in stochastic settings. DP methods converge roughly two orders of magnitude faster than RL, highlighting the "price of learning on the fly."

5. Significance and Implications

Reframing RL for Robotics: The paper argues that for goal-oriented robotic tasks, RL should move away from arbitrary discount factors and biological reward shaping. Instead, it should adopt TrueCost models with termination actions, which align RL more closely with the proven efficiency of planning algorithms.
Algorithm Selection: The results suggest that if a model (even a learned one) is available or if the environment is deterministic, classical planning algorithms (Dijkstra, VI) are superior in terms of speed and reliability. RL is necessary only when the model is unknown and must be learned simultaneously with planning, but this comes with a heavy computational cost.
Parameter Tuning: The study provides concrete guidelines for tuning Q-learning. For instance, in deterministic settings, $\epsilon=0$ is often optimal for finding a path, while $\epsilon \approx 0.9$ is needed for full state-space convergence. In stochastic settings, $\rho$ must be decayed or kept low to handle noise.
Theoretical Unification: By proving the mathematical equivalence of cost/reward and the conditions for episodic equivalence, the paper provides a unified theoretical framework that allows researchers to translate insights between the planning and RL communities.

In conclusion, the paper successfully demystifies the relationship between planning and RL, showing that many RL "features" (like discounting) are often detrimental to goal achievement in engineering contexts, and that RL can be viewed as a specific, often slower, instantiation of dynamic programming when models are unknown.