Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

This paper proposes a primal-dual natural actor-critic algorithm using multi-layer neural network critics and Neural Tangent Kernel theory to establish the first global convergence and cumulative constraint violation guarantees for infinite-horizon Constrained MDPs with general policy parameterizations, overcoming the limitations of previous tabular or linear-critic approaches.

Anirudh Satheesh, Pankaj Kumar Barman, Washim Uddin Mondal, Vaneet Aggarwal

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot to drive a taxi through a busy city. You have two main goals:

  1. Maximize Profit: Get the most fares possible (Reward).
  2. Stay Safe: Never run a red light or hit a pedestrian (Constraint).

In the world of Artificial Intelligence, this is called a Constrained Markov Decision Process (CMDP). The tricky part is that the robot learns by trial and error, and the city traffic is unpredictable (this is "Markovian sampling").

For a long time, the math behind teaching robots this way only worked if the robot had a very simple brain (like a basic lookup table) or a "linear" brain. But to drive a real car, the robot needs a Deep Neural Network—a complex, multi-layered brain capable of understanding nuance. Until now, no one could mathematically prove that a robot with this complex brain would actually learn to drive safely and efficiently without going crazy or breaking the rules.

This paper introduces a new algorithm called PDNAC-NC that finally solves this puzzle. Here is how it works, explained with simple analogies:

1. The Problem: The "Mixing Time" Trap

Imagine you are trying to learn the average speed of cars in a city. If you just watch one car for 5 minutes, you might get lucky and see a traffic jam, or unlucky and see a clear road. To get a true average, you usually need to wait a long time for the traffic to "mix" and settle into a pattern.

In old AI algorithms, researchers assumed you had a magical "Oracle" (a crystal ball) that told you exactly how long to wait for the traffic to mix before you could take a measurement. If you didn't wait long enough, your data was biased. If you waited too long, you wasted time.

  • The Paper's Fix: Instead of waiting for a crystal ball, the authors use a technique called Multi-Level Monte Carlo (MLMC).
  • The Analogy: Imagine you want to know the average height of people in a room. Instead of measuring everyone one by one (which takes forever), you ask a few people, then a few more, then a few more, but you weigh the answers differently based on how many people you asked. This method gives you a perfectly accurate answer without needing to know exactly how many people are in the room or how long you need to wait. It uses every single piece of data you collect, rather than throwing away the "early" data.

2. The Problem: The "Complex Brain" (Neural Networks)

Deep neural networks are like black boxes. When you tweak them slightly, they can change their behavior in wild, unpredictable ways. This makes it hard to prove they will converge (settle down) to a good solution.

  • The Paper's Fix: They use a concept called the Neural Tangent Kernel (NTK).
  • The Analogy: Imagine a giant, complex trampoline. If you push it in the middle, it bends in a crazy shape. But, if you only push it very, very slightly near the center, the trampoline behaves almost like a flat, straight piece of wood. It's predictable and linear.
    The authors force the robot's brain to stay within a tiny "neighborhood" of its starting point. This keeps the brain "linear" enough for the math to work, while still being powerful enough to learn complex driving skills. They prove that as the brain gets wider (more neurons), this "linear approximation" becomes almost perfect.

3. The Problem: The "Tug-of-War" (Primal-Dual)

The robot has to balance two things: driving fast (Actor) and following rules (Critic/Dual).

  • The Actor wants to go fast.
  • The Critic (the safety monitor) wants to stop it if it breaks rules.
  • The Dual Variable is the "price" of breaking a rule. If the robot breaks a rule often, the price goes up, and the Actor is forced to slow down.

In "Average Reward" settings (like driving forever), the math is notoriously unstable because there is no "end game" to look forward to. The errors from the Actor, Critic, and Safety Monitor can pile up and cause the whole system to crash.

  • The Paper's Fix: They developed a new way to track these errors.
  • The Analogy: Think of it like a tightrope walker (the Actor) holding a long pole (the Critic) while a wind gusts (the Constraints). If the walker leans too far one way, the pole swings. The authors created a new mathematical "safety harness" that tracks how much the walker, the pole, and the wind are all wobbling. They proved that even with the wind, the walker won't fall off the rope, provided the pole isn't too heavy and the harness is tight enough.

The Big Result

The paper proves that this new algorithm:

  1. Converges Globally: It doesn't just get "okay" at driving; it is mathematically guaranteed to find the best possible driving strategy over time.
  2. Respects Constraints: It guarantees that the robot won't break the rules too often (specifically, the violations drop as the robot learns more).
  3. Needs No Crystal Ball: It works without knowing the "mixing time" of the environment.
  4. Works with Deep Learning: It is the first time this level of safety and efficiency has been proven for robots using complex, multi-layered neural networks.

In summary: The authors built a new training method for AI agents that allows them to learn complex tasks (like driving or managing a power grid) while strictly obeying safety rules. They did this by using a clever sampling trick to avoid waiting for "perfect" data and by keeping the AI's brain in a "safe zone" where the math works, all while proving that the robot will eventually learn to be both efficient and safe.