Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Imagine you are trying to teach a robot to play a complex video game, like a racing simulator or a strategy game. But there's a catch: you cannot let the robot play the game anymore. You only have a giant hard drive full of recordings of a human expert playing the game in the past. Your goal is to teach the robot to play better than the human, using only those old recordings. This is called Offline Reinforcement Learning.

For a long time, the math behind this was tricky. Most successful theories worked like this:

The Critic: A "judge" looks at the recordings and says, "If you do this move in this situation, you'll get a high score."
The Actor: The robot's brain. In old theories, the robot's brain wasn't a separate, flexible thing. It was just a mirror reflecting the judge's advice. If the judge said "Turn left," the robot turned left. If the judge said "Turn right," the robot turned right.

The Problem: The "Mirror" is Too Rigid

The authors of this paper point out a major flaw in this "mirror" approach.

Real life is continuous: In the real world (like driving a car), you don't just have "Left" or "Right." You have "Turn the wheel 12.4 degrees." The old mirror method struggled with these infinite possibilities.
The "Contextual Coupling" Trap: The old method treated every situation (state) as an isolated island. It said, "In this specific traffic jam, turn left." But in reality, the robot's brain is one single network (like a neural network) that connects all situations. If you tweak the brain to fix the traffic jam, it might accidentally break how it handles a highway. The old math didn't account for how changing one part of the brain affects the whole body. They call this Contextual Coupling.

Think of it like tuning a piano. The old method tried to tune each key (state) independently. But if you tune the "C" key, the tension changes for the whole string, affecting the "D" and "E" keys. You can't tune them in isolation without breaking the harmony.

The Solution: A New Way to Learn

The authors propose a new way to update the robot's brain (the "Actor") that respects this connection. They introduce two main strategies:

1. The "Least Squares" Approach (LSPU)

Imagine the robot is trying to learn a dance.

The Goal: The robot wants to move in a way that matches the "Advantage" (the extra points it would get by doing a specific move).
The Method: The robot looks at all the old recordings and asks, "What is the simplest mathematical formula that predicts the best moves?" It uses a technique called Least Squares Regression.
The Analogy: It's like drawing a straight line through a cloud of scattered dots on a graph. The robot tries to find the line (the policy) that fits the "good moves" best. If the judge (Critic) and the robot (Actor) speak the same language (mathematically compatible), this works perfectly.

2. The "Distributionally Robust" Approach (DRPU)

Sometimes, the judge and the robot don't speak the same language. The judge might be very optimistic about certain moves that the robot's brain can't actually execute well.

The Problem: If the robot just blindly follows the judge, it might get tricked by bad data.
The Method: This approach is like a paranoid safety inspector. Instead of just looking at the average score, the robot asks: "What is the worst-case scenario if I make a mistake in my understanding of the data?" It prepares for the worst possible interpretation of the old recordings.
The Magic Connection: The paper discovers something surprising. If the robot's training data comes from the exact same expert it is trying to copy (no distribution shift), this "paranoid" method actually turns into Behavior Cloning. It's like the robot realizing, "Oh, I don't need to be a genius; I just need to perfectly mimic the expert." This unifies two different fields of AI (learning from scratch vs. copying experts).

Why This Matters

It works for continuous actions: You can now use this for robots with smooth, continuous movements (like a drone flying or a car steering), not just games with simple buttons.
It's practical: It allows the robot to have its own "brain" (a neural network) that isn't just a slave to the judge. It can learn complex, independent strategies.
It's safe: By accounting for the "coupling" (how one part of the brain affects the rest), the robot doesn't break itself while trying to learn.

The Big Picture

This paper is like fixing the blueprint for teaching a robot from a history book.

Old Way: "Here is a list of instructions. Do exactly what the list says, key by key." (Fails when the instructions are too complex or the robot's brain is too connected).
New Way: "Here is a history book. Look at the patterns, understand the connections between moves, and build a brain that can generalize. If you get stuck, prepare for the worst, and if you have a perfect copy of the expert, just mimic them perfectly."

The authors have successfully bridged the gap between complex mathematical theory and the messy, continuous reality of real-world robotics and AI.

1. Problem Statement

The paper addresses Offline Reinforcement Learning (RL) under general function approximation, specifically focusing on the challenge of learning a good policy from a fixed dataset without environment interaction.

The Gap: Existing theoretical frameworks for offline RL (e.g., PSPI by Xie et al., 2021) rely on state-wise mirror descent. In these methods, the policy (actor) is implicitly induced from the value function (critic) via a multiplicative update rule: $\pi_{k+1}(a|s) \propto \pi_k(a|s) \exp(\eta f_k(s,a))$ .
Limitations of Prior Work:
1. Action Space Restriction: These methods typically require finite action spaces (log-cardinality bounds) and fail to provide guarantees for continuous action spaces ubiquitous in control (e.g., robotics).
2. Lack of Standalone Parameterization: The implicit induction of the policy prevents the use of standalone parametric policies (e.g., a separate neural network actor), which is the standard in modern practice.
3. Contextual Coupling: When attempting to extend mirror descent to standalone parametric policies by coupling state-wise updates through shared parameters, the methods fail due to distribution mismatch between the data distribution and the comparator policy's occupancy.

The goal is to develop computationally tractable algorithms with strong statistical guarantees for parametric policies over large or continuous action spaces in an offline setting.

2. Methodology

The authors propose a new framework that moves beyond state-wise mirror descent, utilizing Compatible Function Approximation (CFA) and Natural Policy Gradient (NPG) insights.

A. Theoretical Analysis of Contextual Mirror Descent

The paper first analyzes why a naive extension of PSPI to parametric policies fails.

Contextual Coupling: When updating a shared parameter $\theta$ based on a state-wise mirror descent objective aggregated over the data distribution $d_D$ , systematic errors arise because the update minimizes regret under $d_D$ but the performance is evaluated under the comparator distribution $d_{\pi_{cp}}$ .
Hardness Result: The authors prove (Proposition 2) that even with an accurate critic and infinite data, contextual mirror descent can incur a constant per-step regret due to this distribution shift, rendering it ineffective for standalone parametric policies.

B. Regret Decomposition via CFA

To overcome this, the authors derive a Regret Decomposition Lemma (Lemma 3) for general first-order updates ( $\theta_{k+1} = \theta_k + \eta v_k$ ).

The regret is decomposed into an optimization error term and an error term related to Compatible Function Approximation (CFA).
The CFA error ( $err_k$ ) measures how well the policy gradient features ( $\nabla_\theta \log \pi_k$ ) can linearly approximate the advantage function ( $A_k$ ).
This decomposition shifts the focus from state-wise updates to controlling the linear approximation error of the advantage function in the parameter space.

C. Proposed Algorithms

Based on the regret decomposition, the authors propose two statistically and computationally efficient actor updates:

1. Least-Square Policy Update (LSPU)

Mechanism: Treats the CFA error as a linear regression problem. It minimizes the squared loss between the advantage function and the linear projection of the policy gradient features:
$\min_v \mathbb{E}_{d_D} \left[ (A_k(s,a) - v^\top \nabla_\theta \log \pi_k(a|s))^2 \right]$
Connection: This is a form of Natural Policy Gradient (NPG) adapted for offline data, computed directly on the offline distribution without importance weighting.
Guarantee: The regret bound depends on an actor-critic incompatibility term ( $\epsilon_{CFA}$ ). If the actor and critic classes are compatible (e.g., linear features match), this bias vanishes.

2. Distributionally Robust Policy Update (DRPU)

Motivation: LSPU uses squared loss, which may be loose if approximation errors are heterogeneous. DRPU aims to control the linear error directly.
Mechanism: Uses Distributionally Robust Optimization (DRO). It formulates the error under $d_{\pi_{cp}}$ as an importance-weighted expectation under $d_D$ and minimizes the worst-case error over a weight class $W$ (e.g., bounded density ratios).
$\min_v \max_{w \in W} \left| \mathbb{E}_{d_D} [w(s,a) (A_k(s,a) - v^\top \nabla_\theta \log \pi_k(a|s))] \right|$
Computation: Under a bounded density ratio class ( $W_\infty$ ), this problem is equivalent to minimizing a Conditional Value-at-Risk (CVaR) objective, which can be solved efficiently via Linear Programming (LP) or Second-Order Cone Programming (SOCP).
Unification with Imitation Learning: A key theoretical insight is that when the data distribution matches the comparator distribution ( $d_D = d_{\pi_{cp}}$ ), DRPU reduces to Behavior Cloning (minimizing expected KL divergence), providing a unification between offline RL and imitation learning.

3. Key Contributions

Extension to Continuous/General Action Spaces: The paper extends the theoretical guarantees of offline RL (specifically PSPI) to continuous action spaces using measure-theoretic arguments, removing the dependency on log-cardinality of the action space.
Identification of "Contextual Coupling": The authors identify and formally prove that naive contextual mirror descent fails for standalone parametric policies due to distribution mismatch, explaining a fundamental gap between theory and practice.
Novel Regret Decomposition: They introduce a regret decomposition based on Compatible Function Approximation (CFA) and Natural Policy Gradients, which serves as a guiding principle for designing actor updates in offline settings.
Two Efficient Algorithms:
- LSPU: A regression-based update with provable guarantees, closely related to NPG.
- DRPU: A distributionally robust update that handles actor-critic incompatibility more robustly and unifies offline RL with imitation learning (Behavior Cloning) in the no-shift setting.
Statistical and Computational Efficiency: Both algorithms achieve finite-sample regret bounds with rates of $O(1/\sqrt{N})$ (statistical error) and $O(1/\sqrt{K})$ (optimization error), while remaining computationally tractable (solving convex programs).

4. Key Results

Theorem 1 (PSPI Extension): Establishes regret bounds for PSPI in general (including continuous) action spaces, replacing the $\log|\mathcal{A}|$ term with a KL divergence term $D_{KL}(\pi_{cp} \| \pi_1)$ .
Proposition 2 (Hardness): Proves that contextual mirror descent incurs constant per-step regret ( $\Omega(1)$ ) even with accurate critics, due to distribution shift.
Theorem 4 (LSPU Guarantee): Provides a regret bound for LSPU consisting of three terms: optimization error, intrinsic bias (actor-critic incompatibility $\epsilon_{CFA}$ ), and statistical estimation error ( $O(\sqrt{C/N})$ ).
Theorem 5 (DRPU Guarantee): Provides a similar regret bound for DRPU, showing it is more robust to incompatibility. The bound scales with $\sqrt{C}$ (improved from $C$ in standard importance weighting) due to the "tail-peeling" technique used in the CVaR analysis.
Behavior Cloning Recovery: Theoretical proof that when $d_D = d_{\pi_{cp}}$ , DRPU minimizes the expected KL divergence to the comparator policy, effectively recovering Behavior Cloning.

5. Significance

Bridging Theory and Practice: This work resolves a critical disconnect where theoretical offline RL methods were restricted to implicit policies or finite actions, while practical applications rely on standalone parametric policies (e.g., deep neural networks) and continuous controls.
New Paradigm for Actor Updates: It shifts the focus from "state-wise" updates to "parameter-space" updates guided by CFA, offering a more robust theoretical foundation for actor-critic methods in offline settings.
Unification of Fields: By showing that DRPU recovers Behavior Cloning under specific conditions, the paper provides a theoretical bridge between Offline RL and Imitation Learning, suggesting that pessimism and robustness are key to unifying these paradigms.
Algorithmic Practicality: The proposed algorithms (LSPU and DRPU) are computationally efficient (reducing to convex optimization problems), making them viable for real-world deployment in robotics and control systems where continuous actions are the norm.