MDP Planning as Policy Inference

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot how to navigate a tricky maze to find treasure.

The Old Way: "The Entropy Chef"

Most modern AI methods (like the one called Soft Actor-Critic mentioned in the paper) work a bit like a chef who is afraid of being too predictable. To make the robot explore, the chef adds a secret ingredient called "Entropy" (randomness) to the recipe.

The Problem: The chef doesn't actually know why the robot should be random. They just add a fixed amount of "chaos" to the mix to see what happens. If the robot gets stuck, the chef just tweaks the amount of chaos. It's a bit of a guess-and-check game. The robot learns a single, fuzzy strategy that is "okay" at many things but maybe not perfect at the best thing.

The New Way: "The Council of Experts"

This paper proposes a different approach called MDP Planning as Policy Inference. Instead of adding random chaos, the authors treat the robot's strategy as a mystery to be solved.

Think of it like this:

The Council: Imagine a room full of 100 different experts (particles). Each expert has a slightly different idea of how to solve the maze.
The Test: We send all 100 experts into the maze at the same time.
The Twist (The "Shared Reality"): In the old way, if two experts walked into a slippery patch of the maze, they might slip in different directions just because of bad luck. This makes it hard to tell who is actually smart and who just got unlucky.
- The Innovation: The authors say, "Let's make the maze behave the same way for everyone." If Expert A and Expert B both step on the same slippery tile, they both slip in the exact same direction. This ensures that if one expert does better than the other, it's because they made a better choice, not because the universe was nicer to them.
The Voting: After the run, we look at who got the most treasure. We don't just pick the single "winner" and fire everyone else. Instead, we keep the whole group, but we give more "voting power" to the experts who did well.
The Decision: When the robot needs to take a step in the real world, it doesn't follow one rigid plan. Instead, it randomly picks one expert from the council (weighted by their past success) and asks, "What would you do?"

Why is this better? (The Metaphors)

1. Uncertainty vs. Randomness

Old Way: The robot is random because we told it to be random (like a spinning top).
New Way: The robot is random because it is unsure.
- Analogy: Imagine you are at a fork in the road.
  - If you are sure the left path leads to gold, you go left 100% of the time.
  - If you are unsure (maybe the left path is gold, maybe the right is), you might flip a coin.
- In this new method, the robot only acts randomly when it genuinely doesn't know which path is best. If it knows the answer, it becomes 100% decisive. This is called Thompson Sampling.

2. The "Slippery" Problem
The paper solves a specific headache in AI training. Imagine training a driver on a video game where the car sometimes slides.

If you train 100 drivers and they all slide differently every time, you can't tell if Driver A is a better driver than Driver B, or if Driver A just got lucky with the road conditions.
The authors' method forces the "road conditions" (the slide) to be identical for all 100 drivers at the same moment. Now, the only difference is their driving skill. This makes the learning much faster and more accurate.

The Results: What Happened?

The authors tested this on games like Blackjack and Grid Worlds (mazes).

In the Maze: The old method (SAC) sometimes made the robot wander near the walls just to be "random" and explore. The new method (Policy Inference) kept the robot focused on the goal. It only wandered when it was truly confused.
In Blackjack: The new method found a better strategy than the old method without needing to tweak the "randomness" settings as much.
The Catch: The new method is very sensitive to how much "reward" (money) you give for winning. If the reward is huge, the robot becomes too confident too fast. If the reward is small, it stays unsure longer. The old method is a bit more forgiving of these numbers.

The Bottom Line

This paper teaches us that instead of forcing an AI to be random to help it learn, we should let the AI figure out its own uncertainty.

By treating the AI's strategy as a group of competing hypotheses and testing them fairly (using the same "weather" for everyone), the AI learns to be decisive when it knows the answer and cautiously random when it doesn't. It's the difference between a robot that is "confused by design" and a robot that is "honestly unsure."

1. Problem Statement

The paper addresses the challenge of planning in episodic Markov Decision Processes (MDPs) with stochastic transitions. Traditional approaches often rely on:

Deterministic optimization: Finding a single optimal policy, which ignores uncertainty about the solution.
Entropy-regularized RL (e.g., SAC): Introducing stochasticity as a modeling preference or exploration mechanism via an entropy term ( $E[R] + \alpha H(\pi)$ ), which alters the original objective function.

The author proposes a paradigm shift: formulating MDP planning as Bayesian inference over policies. The goal is to preserve the classical expected-return optimality criterion while explicitly modeling uncertainty over optimal behavior as a posterior distribution, rather than as an artifact of approximation or heuristic regularization.

2. Methodology

A. Probabilistic Formulation

The core innovation is treating the policy $\pi$ itself as the latent random variable, rather than treating optimality as a fictitious observation or the trajectory as the primary object of inference.

Unnormalized Density: An unnormalized log-probability is assigned to each policy, defined as its expected return:
$\log \tilde{p}(\pi) = \mathbb{E}_{\tau_\pi} \left[ \sum_{t=1}^H R(s_t, a_t, s_{t+1}) \right]$
This creates a Boltzmann–Gibbs distribution over policies where modes correspond to return-maximizing policies.
No Auxiliary Variables: Unlike "Control-as-Inference" approaches that introduce optimality variables ( $O_t$ ), this method works directly with the policy's expected return.
Stochastic Estimation: Since the true expected return is intractable, it is estimated via Monte Carlo rollouts (noisy estimators), treating the return calculation as an unbiased but noisy observation of the policy's log-density.

B. Inference Algorithm: Variational Sequential Monte Carlo (VSMC)

To approximate the posterior over policies in discrete MDPs, the author adapts Variational Sequential Monte Carlo (VSMC). Two critical modifications are introduced to handle the specific constraints of policy inference:

Deterministic Policy Consistency:
- The algorithm infers a distribution over deterministic policies.
- When a particle visits a state $s$ for the first time, an action is sampled from a proposal distribution $q(a|s)$ .
- On subsequent revisits to $s$ within the same trajectory, the same action is reused. This ensures the particle represents a coherent deterministic policy rather than a stochastic one.
Coupled Transition Randomness:
- In standard SMC, particles experience independent environment noise. Here, transition randomness is shared (coupled) across all particles within a sweep.
- If two particles visit the same state $s$ and take the same action $a$ at the same visit count $k$ , they are forced to transition to the same successor state $s'$ .
- Purpose: This ensures that particle weights reflect differences in policy quality, not differences in independent realizations of simulator noise.

C. Action Selection (Control)

Recurrent Thompson Sampling: To act in the environment, the agent samples a deterministic policy from the inferred posterior and executes the action prescribed by that policy.
Stochasticity Source: The resulting control policy is stochastic not because the policy itself is parameterized as stochastic (like in SAC), but because the agent marginalizes over the uncertainty of which deterministic policy is optimal.
Reward Scale Interpretation: The magnitude of rewards controls the posterior concentration. Large rewards lead to a sharp posterior (near-deterministic behavior), while small rewards yield a diffuse posterior (optimal stochastic behavior under preference uncertainty).

D. Optimization

The method optimizes a variational lower bound on the log-evidence ( $\log \hat{Z}$ ). Because the action proposal is categorical (non-reparameterizable), the objective includes a score-function term (REINFORCE-style gradient) to ensure unbiased gradient estimation, utilizing a temporally stratified baseline for variance reduction.

3. Key Contributions

Bayesian Policy Formulation: A novel formulation of MDP planning as inference over policies that preserves the standard expected-return objective, yielding an optimal stochastic policy via posterior predictive sampling under preference uncertainty.
Adapted VSMC for MDPs: An algorithmic adaptation of VSMC for discrete MDPs featuring policy consistency (memoization of actions per state) and coupled transition randomness (common random numbers across particles) to isolate policy differences from environmental noise.
Thompson-Style Control: A demonstration that acting via posterior predictive sampling induces a stochastic controller interpretable as recurrent Thompson sampling, distinct from entropy regularization.

4. Experimental Results

The framework was evaluated on Grid Worlds, Blackjack, Triangle Tireworld, and Academic Advising, comparing Policy VSMC against Discrete Soft Actor-Critic (SAC).

Grid Worlds:
- VSMC successfully visualized multimodal policies and uncertainty.
- Ablation: Enforcing deterministic policies and coupling dynamics were crucial; without coupling, weights reflected noise rather than policy quality.
- Comparison: VSMC avoided "boundary-hugging" behaviors seen in SAC (which maximize entropy). VSMC policies were more goal-directed.
Blackjack:
- VSMC achieved higher expected rewards than SAC with standard entropy weights ( $\alpha=1$ ).
- To match VSMC's performance, SAC required significantly lower entropy weights ( $\alpha=0.1$ ), suggesting VSMC naturally balances exploration and exploitation without heavy regularization.
- VSMC showed lower draw probabilities compared to both SAC and the optimal policy.
Triangle Tireworld (Risk-Reward):
- Sensitivity to Reward Scale: The original rewards created a large gap between "fast/risky" and "safe/slow" paths, causing the VSMC posterior to concentrate too sharply on suboptimal risky paths (high variance).
- Solution: Scaling rewards down (reducing the gap) smoothed the posterior, allowing VSMC to perform comparably to SAC. This highlights that VSMC's performance depends on the reward scale encoding preference strength, not just ranking.
Academic Advising (Combinatorial):
- Both methods struggled with harder instances (long horizons, high branching).
- VSMC exhibited heavier-tailed return distributions, indicating it maintained a broader search over behaviors even when convergence was difficult, whereas SAC tended to collapse to random walks minimizing immediate costs.

5. Significance and Conclusion

The paper establishes a rigorous theoretical link between planning and Bayesian inference without modifying the underlying MDP objective.

Decoupling Uncertainty: It cleanly separates aleatoric uncertainty (environment noise) from epistemic uncertainty (uncertainty about the optimal policy).
Interpretability: The stochasticity in the resulting controller is interpretable as a lack of consensus among high-performing deterministic policies, rather than an arbitrary entropy penalty.
Implications: This approach offers a principled alternative to entropy-regularized RL, particularly in domains where understanding the distribution of optimal behaviors is as important as finding a single optimal policy. It suggests that "stochastic control" can emerge naturally from uncertainty quantification rather than being forced via regularization.

The work suggests that while VSMC-based policy inference is sensitive to reward scaling (unlike classical dynamic programming), it provides a richer representation of decision-making under uncertainty, particularly in complex, stochastic environments.