Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

Imagine you are teaching a robot to drive a taxi through a busy city. You have two main goals:

Maximize Profit: Get the most fares possible (Reward).
Stay Safe: Never run a red light or hit a pedestrian (Constraint).

In the world of Artificial Intelligence, this is called a Constrained Markov Decision Process (CMDP). The tricky part is that the robot learns by trial and error, and the city traffic is unpredictable (this is "Markovian sampling").

For a long time, the math behind teaching robots this way only worked if the robot had a very simple brain (like a basic lookup table) or a "linear" brain. But to drive a real car, the robot needs a Deep Neural Network—a complex, multi-layered brain capable of understanding nuance. Until now, no one could mathematically prove that a robot with this complex brain would actually learn to drive safely and efficiently without going crazy or breaking the rules.

This paper introduces a new algorithm called PDNAC-NC that finally solves this puzzle. Here is how it works, explained with simple analogies:

1. The Problem: The "Mixing Time" Trap

Imagine you are trying to learn the average speed of cars in a city. If you just watch one car for 5 minutes, you might get lucky and see a traffic jam, or unlucky and see a clear road. To get a true average, you usually need to wait a long time for the traffic to "mix" and settle into a pattern.

In old AI algorithms, researchers assumed you had a magical "Oracle" (a crystal ball) that told you exactly how long to wait for the traffic to mix before you could take a measurement. If you didn't wait long enough, your data was biased. If you waited too long, you wasted time.

The Paper's Fix: Instead of waiting for a crystal ball, the authors use a technique called Multi-Level Monte Carlo (MLMC).
The Analogy: Imagine you want to know the average height of people in a room. Instead of measuring everyone one by one (which takes forever), you ask a few people, then a few more, then a few more, but you weigh the answers differently based on how many people you asked. This method gives you a perfectly accurate answer without needing to know exactly how many people are in the room or how long you need to wait. It uses every single piece of data you collect, rather than throwing away the "early" data.

2. The Problem: The "Complex Brain" (Neural Networks)

Deep neural networks are like black boxes. When you tweak them slightly, they can change their behavior in wild, unpredictable ways. This makes it hard to prove they will converge (settle down) to a good solution.

The Paper's Fix: They use a concept called the Neural Tangent Kernel (NTK).
The Analogy: Imagine a giant, complex trampoline. If you push it in the middle, it bends in a crazy shape. But, if you only push it very, very slightly near the center, the trampoline behaves almost like a flat, straight piece of wood. It's predictable and linear.
The authors force the robot's brain to stay within a tiny "neighborhood" of its starting point. This keeps the brain "linear" enough for the math to work, while still being powerful enough to learn complex driving skills. They prove that as the brain gets wider (more neurons), this "linear approximation" becomes almost perfect.

3. The Problem: The "Tug-of-War" (Primal-Dual)

The robot has to balance two things: driving fast (Actor) and following rules (Critic/Dual).

The Actor wants to go fast.
The Critic (the safety monitor) wants to stop it if it breaks rules.
The Dual Variable is the "price" of breaking a rule. If the robot breaks a rule often, the price goes up, and the Actor is forced to slow down.

In "Average Reward" settings (like driving forever), the math is notoriously unstable because there is no "end game" to look forward to. The errors from the Actor, Critic, and Safety Monitor can pile up and cause the whole system to crash.

The Paper's Fix: They developed a new way to track these errors.
The Analogy: Think of it like a tightrope walker (the Actor) holding a long pole (the Critic) while a wind gusts (the Constraints). If the walker leans too far one way, the pole swings. The authors created a new mathematical "safety harness" that tracks how much the walker, the pole, and the wind are all wobbling. They proved that even with the wind, the walker won't fall off the rope, provided the pole isn't too heavy and the harness is tight enough.

The Big Result

The paper proves that this new algorithm:

Converges Globally: It doesn't just get "okay" at driving; it is mathematically guaranteed to find the best possible driving strategy over time.
Respects Constraints: It guarantees that the robot won't break the rules too often (specifically, the violations drop as the robot learns more).
Needs No Crystal Ball: It works without knowing the "mixing time" of the environment.
Works with Deep Learning: It is the first time this level of safety and efficiency has been proven for robots using complex, multi-layered neural networks.

In summary: The authors built a new training method for AI agents that allows them to learn complex tasks (like driving or managing a power grid) while strictly obeying safety rules. They did this by using a clever sampling trick to avoid waiting for "perfect" data and by keeping the AI's brain in a "safe zone" where the math works, all while proving that the robot will eventually learn to be both efficient and safe.

1. Problem Statement

The paper addresses the challenge of solving Infinite-Horizon Average-Reward Constrained Markov Decision Processes (CMDPs) using deep reinforcement learning.

Context: While deep RL has succeeded in unconstrained tasks, safety-critical applications (robotics, healthcare, transportation) require agents to satisfy strict constraints (e.g., cost limits) while maximizing reward.
The Gap: Existing theoretical guarantees for constrained RL largely rely on:
- Tabular policies or linear function approximations, which fail in high-dimensional continuous spaces.
- Discounted reward formulations, which do not directly apply to average-reward settings (common in steady-state operations).
- Mixing-time oracles: Many algorithms require prior knowledge of the Markov chain's mixing time ( $\tau_{mix}$ ) to discard correlated data (data dropping), which is impractical in real-world scenarios.
Core Question: Can one design a primal-dual actor-critic algorithm with general policy parameterizations and multi-layer neural network critics that achieves provable global convergence for average-reward CMDPs under Markovian sampling, without requiring a mixing-time oracle?

2. Methodology: PDNAC-NC

The authors propose the Primal-Dual Natural Actor-Critic with Neural Critic (PDNAC-NC) algorithm. The method integrates three key technical components:

A. Primal-Dual Framework

The problem is formulated as a saddle-point optimization:
$\max_{\theta} \min_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = J_r(\theta) + \lambda J_c(\theta)$
where $\theta$ represents the policy parameters and $\lambda$ is the dual variable (Lagrange multiplier) penalizing constraint violations. The algorithm updates $\theta$ via Natural Policy Gradient (NPG) and $\lambda$ via projected gradient ascent.

B. Neural Critic with NTK Regime

Instead of linear critics, the authors use a multi-layer feedforward neural network to approximate the action-value function ( $Q$ -function).

NTK Analysis: To handle the non-linearity of the neural network, the analysis is conducted within the Neural Tangent Kernel (NTK) regime. The critic parameters are constrained to a small ball around their initialization ( $SR$ ), ensuring the network behaves approximately linearly.
Linearization: This allows the authors to bound the approximation error and prove convergence using tools from linear function approximation theory, extended to the deep learning setting.

C. Multi-Level Monte Carlo (MLMC) Estimation

To address the Markovian sampling dependence (where consecutive samples are correlated) without discarding data:

Standard Approach: Prior works use "data dropping," keeping only one sample every $\tau_{mix}$ steps, requiring an oracle for $\tau_{mix}$ .
Proposed Approach: The authors employ Multi-Level Monte Carlo (MLMC) estimators within a nested-loop structure.
- Trajectory lengths are sampled from a geometric distribution ( $Geom(1/2)$ ).
- This construction provides unbiased gradient estimates that correct for Markovian bias without needing to know the mixing time or discard samples.
- It achieves the same bias reduction as averaging $T_{max}$ samples but with only $O(\log T_{max})$ samples on average.

3. Key Contributions

First Global Convergence for Neural CMDPs: This is the first work to establish global convergence guarantees for average-reward CMDPs using general policy parameterizations and multi-layer neural critics.
Elimination of Mixing-Time Oracle: By integrating MLMC, the algorithm removes the restrictive assumption of knowing the mixing time $\tau_{mix}$ , a significant practical advancement over prior theoretical works.
Novel Coupled Analysis: The authors develop a refined analysis that tracks error propagation across three coupled components:
- The Critic (Neural network approximation).
- The NPG Estimator (Natural gradient direction).
- The Primal-Dual Dynamics (Policy and Lagrange multiplier updates).
- They prove that the coupled errors do not diverge despite the non-contractive nature of the average-reward Bellman operator.
Theoretical Rates: They establish convergence rates of $\tilde{O}(T^{-1/4})$ for both the optimality gap (reward) and cumulative constraint violation.

4. Theoretical Results

Under standard assumptions (Ergodicity, Slater's condition, Smooth activation functions, and Bounded/Lipschitz score functions), the algorithm achieves the following bounds after $K$ outer iterations (where total samples $T \approx K \cdot H \cdot \text{polylog}$ ):

Optimality Gap (Reward):
$\frac{1}{K} \sum_{k=0}^{K-1} \mathbb{E}[J_r(\pi^*) - J_r(\theta_k)] \leq \tilde{O}\left(\sqrt{\epsilon_{bias}} + \sqrt{\epsilon_{app}} + T^{-1/4} + m^{-1/4}\right)$
Constraint Violation:
$\frac{1}{K} \sum_{k=0}^{K-1} \mathbb{E}[-J_c(\theta_k)] \leq \tilde{O}\left(\sqrt{\epsilon_{bias}} + \sqrt{\epsilon_{app}} + T^{-1/4} + m^{-1/4}\right)$

Terms explained:

$\epsilon_{bias}$ : Approximation error due to the expressivity of the policy class.
$\epsilon_{app}$ : Approximation error due to the critic class (NTK linearization).
$m$ : Network width (larger $m$ reduces linearization error).
$T^{-1/4}$ : The convergence rate with respect to the total number of samples.

5. Significance and Limitations

Significance:

Bridging Theory and Practice: The work moves theoretical RL analysis from tabular/linear settings to the deep learning regime (neural critics) for constrained problems.
Practical Feasibility: By removing the need for a mixing-time oracle and avoiding data dropping, the algorithm is more applicable to real-world systems where mixing times are unknown and data efficiency is critical.
Foundational Extension: It extends the theoretical foundations of Actor-Critic methods beyond the linear-critic regime, providing a roadmap for analyzing deep constrained RL.

Limitations:

NTK Regime: The analysis relies on the "lazy training" regime where the network is heavily over-parameterized ( $m$ is large) and parameters stay close to initialization. This limits the ability to capture deep feature representations that occur in standard deep RL training.
Convergence Rate: The rate of $\tilde{O}(T^{-1/4})$ is not order-optimal compared to unconstrained natural actor-critic methods (which can achieve $\tilde{O}(T^{-1/2})$ ). The bottleneck is the squared bias analysis in the presence of the NTK projection operator.
Ergodicity Assumption: The theory assumes the induced Markov chain is ergodic (irreducible and aperiodic) for all policies, which may not hold in safety-critical domains with absorbing states (e.g., system crashes).

In conclusion, this paper provides a rigorous theoretical framework for safe, deep reinforcement learning in continuous, average-reward settings, overcoming significant barriers related to neural network non-linearity and Markovian sampling dependencies.

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

1. The Problem: The "Mixing Time" Trap

2. The Problem: The "Complex Brain" (Neural Networks)

3. The Problem: The "Tug-of-War" (Primal-Dual)

The Big Result

1. Problem Statement

2. Methodology: PDNAC-NC

A. Primal-Dual Framework

B. Neural Critic with NTK Regime

C. Multi-Level Monte Carlo (MLMC) Estimation

3. Key Contributions

4. Theoretical Results

5. Significance and Limitations

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions