Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^π$ Realizability for Deterministic Dynamics

The Big Picture: Learning a Video Game Without a Save Button

Imagine you are trying to learn how to play a very difficult video game.

The Goal: You want to get the highest score possible.
The Rules: The game has a specific structure (the "Linear Qπ Realizability" assumption), meaning there's a hidden pattern to the points you can get, but you don't know the pattern yet.
The Problem: In the past, algorithms that could learn this pattern were either too slow (like trying to solve a math problem that takes a billion years) or unfair (they cheated by using a "Save/Load" feature).

The "Save/Load" Cheat:
Imagine you are playing a game, and you reach a tricky level. You die. In a normal game, you have to restart from the beginning. But in the "cheat" version (called a simulator or generative model), you can hit "Load Game" and instantly be back at that tricky spot to try again.

Old algorithms relied on this cheat. They would get to a hard spot, save the game, try 1,000 different moves, and then move on.
The Catch: In the real world (and in this paper's setting), you cannot save and reload. You start at a random place, play through, and when you die, you start over from a different random place. You might never see that specific tricky spot again.

The Solution: "Frozen Policy Iteration" (FPI)

The authors propose a new way to learn called Frozen Policy Iteration. Think of it as a smart explorer who knows when to stop guessing and start trusting what they already know.

Here is how it works, broken down into three simple steps:

1. The "High-Confidence" Map

Imagine you are exploring a dark cave. You have a flashlight (your data).

When you shine the light on a rock and see it clearly, you mark it on your map as "Safe/Known."
When you shine the light and it's still dark, you mark it as "Unknown/Dangerous."

The algorithm only trusts the "Safe" parts of the map. If a spot is "Safe," it assumes it knows the best move there. If a spot is "Dark," it admits it doesn't know and tries something new to learn more.

2. The "Freezing" Trick (The Core Innovation)

This is the magic part.
In old methods, every time you learned something new, you would re-calculate the entire map from scratch. This caused chaos because your "Safe" spots might suddenly look "Unsafe" again because the map changed.

FPI does something different:
Once a spot on your map is "Safe" (you have enough data to be confident), you "Freeze" your decision for that spot.

Metaphor: Imagine you are a teacher grading a test. Once you are 100% sure a student's answer is correct, you stamp it "APPROVED" and put it in a locked box. Even if you learn new things later, you don't go back and un-stamp that answer.
Why this helps: Because you stop changing your mind about the "Safe" spots, the data you collect later remains consistent. You don't need to go back and re-test those spots (which would require the "Save/Load" cheat). You just keep moving forward, learning only about the "Dark" spots.

3. The "One-Way Street"

Because the game has Deterministic Dynamics (if you press "Jump" at this exact spot, you always land in the same place), the algorithm creates a one-way street.

You explore a new, dark area.
Once you figure it out, you "freeze" it.
You move to the next dark area.
You never have to go back to the first area because the rules of the game guarantee that once you leave a "Frozen" area, you will always land in a place you've already mapped or a new place you need to explore.

Why is this a Big Deal?

No Cheating: It works in the "real world" where you can't save and reload your game.
Fast: It doesn't need to solve impossible math problems. It's computationally efficient (it runs fast on a normal computer).
Smart: It achieves a "Regret Bound" (a measure of how many mistakes you make) that is nearly perfect. It learns almost as fast as the theoretical limit allows.

The "Ablation" Experiment (The Proof)

The authors tested their idea on simple robot games (like balancing a pole on a cart).

Team A (With Freezing): The robot learned quickly and got high scores.
Team B (Without Freezing): The robot kept changing its mind about old spots, got confused, and learned much slower.
This proved that the "Freezing" mechanism is the secret sauce.

Summary Analogy

Imagine you are learning a new language by traveling through a city.

Old Way: Every time you learn a new word, you go back to the beginning of the city and re-practice every word you've ever learned to make sure you haven't forgotten. This takes forever.
Frozen Way: You learn a word. Once you are confident you know it, you freeze that knowledge. You lock it in your brain and move on to the next street. You never look back. Because the city is laid out in a straight line (deterministic), you know you won't need to re-learn the first street to understand the last one.

The Result: You learn the whole city much faster, without needing a time machine to go back and practice.

In a Nutshell

This paper introduces a clever algorithm that learns complex tasks efficiently by stopping the habit of second-guessing itself. Once it's sure of something, it locks that knowledge in place ("Freezes" it) and moves forward, allowing it to learn effectively without needing a "Save Game" button.

1. Problem Statement

The paper addresses the challenge of achieving both statistical and computational efficiency in Reinforcement Learning (RL) under the Linear $Q^\pi$ Realizability assumption.

Setting: Finite-horizon Markov Decision Processes (MDPs) with:
- Linear $Q^\pi$ Realizability: The $Q$ -function of any policy is linear in a given state-action feature representation $\phi(s, a) \in \mathbb{R}^d$ .
- Deterministic Transitions: The next state is uniquely determined by the current state and action.
- Stochastic Elements: Initial states and rewards can be stochastic.
- Online RL: The algorithm interacts with the environment sequentially without access to a generative model (simulator) that allows resampling from arbitrary states.
The Gap:
- Existing statistically efficient algorithms under this assumption (e.g., Weisz et al., 2023) rely on computationally intractable optimization (e.g., maintaining large version spaces) or require a generative model/simulator.
- Existing computationally efficient algorithms (e.g., Yin et al., 2022) rely on local access to a simulator to repeatedly resample specific states to ensure they are "well-explored." This is impossible in standard online RL with stochastic initial states, where a specific state might never be visited twice.
Goal: Design an algorithm that is computationally efficient (polynomial time), statistically efficient (low regret), and works in the standard online setting without a simulator.

2. Methodology: Frozen Policy Iteration (FPI)

The authors propose Frozen Policy Iteration (FPI), a novel algorithm that circumvents the need for resampling by strategically managing data usage and policy updates.

Core Mechanisms

High-Confidence Trajectory Selection:
- The algorithm maintains datasets $D_h$ for each time step $h$ .
- It defines a "cover" set, $Cover(D, \epsilon)$ , containing state-action pairs where the least-squares estimate is accurate (based on the elliptical norm of the feature covariance).
- During an episode, the algorithm identifies the first step $h_t$ where the current state-action pair $(s_{h_t}, a_{h_t})$ is not covered by existing data.
- Crucial Step: Only the data from step $h_t$ (and the accumulated reward from that point) is added to the dataset. Data from steps $h > h_t$ (which are already well-explored) is discarded for the purpose of updating the model.
Policy Freezing:
- Once a state $s$ is "covered" (i.e., all actions $a$ at $s$ are in the cover set), the policy $\pi(s)$ is frozen.
- The algorithm defines the $Q$ -function estimate $\hat{Q}(s, a)$ using only the data collected before the state became covered.
- Why this works: Because transitions are deterministic, once a state is covered, the trajectory following it is guaranteed to remain within the high-confidence region. By freezing the policy for these states, the algorithm ensures that all data used to estimate the $Q$ -values for future updates remains effectively on-policy, even though the global policy $\pi_t$ is updated. This eliminates the off-policy bias that usually plagues policy iteration without a simulator.
Regret Minimization via Accuracy Levels (Algorithm 2):
- To achieve a $\sqrt{T}$ regret bound (rather than just PAC guarantees), the algorithm employs a hierarchy of accuracy levels $l = 1, \dots, L$ .
- Each level corresponds to a different precision threshold ( $\epsilon = 2^{-l}$ ).
- The algorithm dynamically adjusts the accuracy level during an episode. If a state is not covered at the current high-accuracy level, it drops to a lower accuracy level to explore, ensuring the suboptimality gap is bounded by the current level's precision.

3. Key Contributions

First Computationally Efficient Online Algorithm: FPI is the first algorithm to achieve both statistical and computational efficiency under Linear $Q^\pi$ realizability in the online setting with stochastic initial states and deterministic dynamics.
Novel "Freezing" Technique: The paper introduces a mechanism to freeze policy updates for well-explored states. This ensures that the dataset remains effectively on-policy without requiring a simulator to resample states, solving a major bottleneck in previous policy iteration approaches.
Optimal Regret Bounds:
- The algorithm achieves a regret bound of $\tilde{O}(\sqrt{d^2 H^6 T})$ , where $d$ is the feature dimension, $H$ is the horizon, and $T$ is the number of episodes.
- This bound is optimal for linear bandits (the special case where $H=1$ ).
Extensions:
- Uniform-PAC: The approach is extended to provide Uniform-PAC guarantees (guarantees that hold for all $\epsilon$ simultaneously).
- General Function Classes: The method is generalized to function classes with bounded Eluder dimension, moving beyond linear realizability.

4. Results

Theoretical Guarantees:
- PAC Setting: The number of episodes with a suboptimality gap greater than $\epsilon$ is bounded by $\tilde{O}(d^2 H^4 / \epsilon^2)$ .
- Regret Setting: The cumulative regret is bounded by $\tilde{O}(\sqrt{d^2 H^6 T} + \sqrt{d H^2 T \kappa})$ , where $\kappa$ is the approximation error.
- Complexity: The algorithm runs in polynomial time and space relative to $d, H, T, |A|$ .
Empirical Validation:
- Experiments were conducted on CartPole-v1 and InvertedPendulum-v4 (OpenAI Gym).
- The authors implemented a version of FPI without the freezing mechanism.
- Result: The version with freezing significantly outperformed the version without it, validating the hypothesis that freezing policies for well-explored states is critical for performance and stability in this setting.

5. Significance and Limitations

Significance:
- Bridging the Gap: It resolves a long-standing open problem regarding the computational-statistical gap in Linear $Q^\pi$ realizability.
- Practicality: By removing the need for a simulator (which is often unavailable in real-world robotics or control tasks), the algorithm makes theoretically sound RL more applicable to practical scenarios.
- Conceptual Shift: It challenges the standard policy iteration paradigm which typically requires re-evaluating states from scratch; instead, it leverages the deterministic nature of transitions to "lock in" knowledge.
Limitations & Open Problems:
- Deterministic Transitions: The current analysis critically relies on deterministic transitions. Extending this to stochastic transitions remains an open problem, as the "on-policy" guarantee relies on the fact that a single trajectory from a covered state stays within the high-confidence region.
- Horizon Dependence: The regret bound has a high polynomial dependence on the horizon $H$ ( $H^6$ ). The authors note this arises from the multi-level accuracy constraints and suggest this is a target for future improvement.
- Initial State: While it handles stochastic initial states, the reliance on deterministic dynamics limits its immediate application to environments with significant process noise (e.g., standard MuJoCo physics without reset noise).

In summary, this paper presents a breakthrough in efficient RL theory by introducing a "freezing" mechanism that allows policy iteration to work efficiently in online settings without simulators, provided the dynamics are deterministic.

Frozen Policy Iteration: Computationally Efficient RL under Linear QπQ^πQπ Realizability for Deterministic Dynamics