Robustness to Model Approximation, Model Learning From Data, and Sample Complexity in Wasserstein Regular MDPs

Imagine you are trying to teach a robot to navigate a maze. To do this perfectly, the robot needs a map (the model) that tells it exactly where every wall is and how the floor feels under its wheels.

In the real world, we rarely have a perfect map. We usually have to learn the map by watching the robot move around, or by using a slightly blurry version of the map. This paper is about answering a very practical question: "If I use a slightly wrong map to teach my robot, how badly will it perform in the real maze?"

Here is the breakdown of the paper's ideas using simple analogies.

1. The Core Problem: The "Blurred Map"

Imagine you are driving a car. You have a GPS (your model) that tells you where to turn.

The Perfect World: Your GPS is 100% accurate. You drive the perfect route.
The Real World: Your GPS is slightly off. Maybe it thinks a road is 10 meters to the left, or it doesn't know about a pothole.
The Question: If you follow the instructions from this "bad" GPS, how much extra gas (cost) will you waste compared to someone with a perfect GPS?

The authors call this difference the "Robustness Error." They want to prove that if your GPS isn't too wrong, you won't crash, and you won't waste too much fuel.

2. The Secret Weapon: The "Wasserstein Distance"

Usually, when scientists compare two maps, they look for exact matches. If Map A says "Road here" and Map B says "Road there," they might say the maps are totally different.

But this paper uses a special tool called Wasserstein-1 distance (think of it as the "Moving Dirt" distance).

The Analogy: Imagine you have a pile of sand (the real world) and a pile of sand on a slightly different spot (your model).
- Old way (Total Variation): If the piles aren't in the exact same spot, the distance is huge. It's like saying, "These maps are useless!"
- Wasserstein way: It asks, "How much effort does it take to move the sand from the model pile to the real pile?" If the piles are close, it takes very little effort. The distance is small.

Why does this matter? In real life (learning from data), your model will almost never be in the exact same spot as reality. It will just be close. The Wasserstein distance is perfect for this because it says, "Hey, they are close enough, so the robot will still do a good job."

3. The Two Scenarios: "The Discounted Trip" vs. "The Long Commute"

The paper looks at two ways of measuring "cost" (how bad the performance is):

Scenario A: The Discounted Trip (Short-term focus)
Imagine you are on a road trip where you care a lot about the next few miles, but you care less about what happens 100 miles from now. The math here is like a rubber band that gets tighter the further you go. The authors show that if your map is close (in the "Moving Dirt" sense), your extra fuel cost stays low.
Scenario B: The Long Commute (Average focus)
Imagine you are driving to work every day for the rest of your life. You care about the average fuel efficiency over years, not just today. This is harder to analyze. The authors use a clever trick: they pretend you are on a "discounted trip" that gets longer and longer until it becomes a "long commute." They prove that even for this long-term view, a slightly blurry map won't ruin your daily commute.

4. Learning from Data: The "Sample Complexity"

This is the most practical part. The paper asks: "How many times do I need to watch the robot drive before I can trust the map I built?"

The Single Path: Imagine you only have one video of the robot driving through the maze. You have to learn the map from just that one path. The paper gives you a formula: "If you watch for $N$ minutes, your error will be roughly $1/\sqrt{N}$."
The Simulator: Imagine you have a video game where you can reset the robot to any spot and try any move as many times as you want. This is much easier! The paper shows that with this "reset" ability, you learn the map much faster.

The Takeaway: The more data you have, the closer your "Moving Dirt" distance gets to zero, and the better your robot performs.

5. The "Noise" Factor: When the World is Unpredictable

Sometimes, the robot doesn't just follow a map; it gets pushed by the wind (noise).

The Problem: You know the rules of the game (the physics), but you don't know the exact pattern of the wind. You have to guess the wind's behavior based on past gusts.
The Solution: The paper shows that if you estimate the "wind distribution" correctly (using the "Moving Dirt" distance), your robot will still navigate the storm safely. Even if you guess the wind is slightly wrong, the robot won't crash.

Summary: What Did They Actually Prove?

Stability: If your model is "close enough" to reality (measured by how much effort it takes to move the probability sand), your robot's performance won't collapse. It will just be slightly less efficient.
The Metric: They proved that Wasserstein distance is the right ruler to use for this job, especially when learning from messy, real-world data.
The Cost: They calculated exactly how much "efficiency" you lose based on how "wrong" your map is.
The Data: They told you exactly how much data you need to collect to get a map that is "good enough" for the job.

In a nutshell: You don't need a perfect map to drive a car. You just need a map that is "close enough" in the right way. This paper gives you the math to prove that "close enough" is actually good enough, and tells you how much data you need to get there.

Here is a detailed technical summary of the paper "Robustness to Model Approximation, Model Learning from Data, and Sample Complexity in Wasserstein Regular MDPs" by Zhou, Song, and Yüksel.

1. Problem Statement

The paper addresses the fundamental problem of robustness in stochastic optimal control when the system model is not known exactly but is approximated or learned from data. Specifically, it investigates the performance degradation (robustness error) incurred when an optimal policy, designed for an approximate model $(\hat{c}, S)$ , is applied to the true system dynamics $(c, T)$ .

The study focuses on two primary performance criteria:

Discounted-cost criterion: Minimizing $\sum \beta^t c(X_t, U_t)$ .
Average-cost criterion: Minimizing $\limsup_{T \to \infty} \frac{1}{T} \sum_{t=0}^{T-1} c(X_t, U_t)$ .

The core challenge is to bound the difference between the true optimal cost and the cost achieved by the policy derived from the approximate model, expressed as a function of the discrepancy between the true and approximate models. A key motivation is empirical model learning, where convergence in stronger metrics (like Total Variation) may not hold, but convergence in the Wasserstein-1 distance ( $W_1$ ) does.

2. Methodology and Framework

2.1. Mathematical Setting

MDP Structure: Discrete-time Markov Decision Processes with Polish state spaces ( $X$ ) and compact action spaces ( $U$ ).
Regularity Assumptions: The paper introduces the concept of Wasserstein Regular MDPs. Key assumptions include:
- Lipschitz continuity of the cost function $c(x, u)$ with respect to the state $x$ .
- Lipschitz continuity of the transition kernel $T(\cdot|x, u)$ with respect to the state $x$ under the Wasserstein-1 metric.
- A contraction condition: $\beta \|T\|_{Lip} < 1$ for discounted cases.
Discrepancy Metrics: The authors utilize two metrics to measure model mismatch:
1. Uniform Discrepancy ( $d_f$ ): Based on the difference in expected values of specific test functions (optimal value functions).
2. Uniform Wasserstein-1 Distance ( $d_{W1}$ ): The supremum of $W_1$ distances between transition kernels over all state-action pairs.

2.2. Analytical Approach

The analysis proceeds in three logical stages:

Continuity of Value Functions: Establishing that the optimal value functions ( $J^*_\beta$ and $J^*_\infty$ ) are continuous (specifically Lipschitz) with respect to changes in the cost function and transition kernel.
Robustness Bounds: Decomposing the total performance loss into:
- The error due to the policy being suboptimal for the true model (but optimal for the approximate model).
- The error due to the difference in optimal value functions between the two models.
- Deriving explicit upper bounds for these terms using the Lipschitz properties of the value functions and the $W_1$ distance between kernels.
Sample Complexity: Translating the deterministic robustness bounds into statistical guarantees. The authors analyze how the estimation error of the model (derived from finite data samples) scales with the number of samples $N$ , leading to bounds on the expected robustness error.

3. Key Contributions

A. Theoretical Robustness Bounds

The paper provides unified bounds for both discounted and average-cost criteria, handling both minorization conditions (Doeblin condition) and vanishing discount factor approaches.

Discounted Cost: The robustness error is bounded by terms involving $\|c - \hat{c}\|_\infty$ and $d_{W1}(T, S)$ . Specifically, the error scales as $O(\frac{1}{(1-\beta)^2})$ in the general case, but improved bounds are derived for specific settings.
Average Cost: The authors establish bounds using two methods:
1. Minorization: Requires a lower bound on the transition kernel.
2. Vanishing Discount: Uses the limit of discounted solutions as $\beta \to 1$ , avoiding the need for explicit minorization conditions in some cases.
Key Insight: The bounds depend on the Lipschitz constants of the optimal value functions. The paper proves that under Wasserstein regularity, these value functions are indeed Lipschitz, allowing the robustness error to be linearly bounded by the $W_1$ distance between kernels.

B. Sample Complexity for Model Learning

The paper derives explicit sample complexity bounds for learning MDPs from data in two scenarios:

Single Trajectory (Ergodic): Data collected along a single controlled path. The bounds depend on the mixing properties of the chain (spectral gap) and the quantization error of the state space.
Independent Samples (Simulator): Data generated via a simulator (i.i.d. transitions).

Results: For a state space partitioned into $M$ bins, the total performance loss (approximation + estimation) achieves a rate of $O(M^{-1/d})$ where $d$ is the state dimension.
Optimality: The paper shows that with an optimal choice of $M$ relative to the sample size $N$ , the parametric rate of $O(N^{-1/2})$ can be achieved for the discounted case, and similar rates are derived for the average-cost case.

C. Disturbance Estimation and Noise Learning

A significant contribution is the application of these results to disturbance estimation.

Problem: The system dynamics are $X_{t+1} = f(X_t, U_t, W_t)$ , where the noise distribution $\mu$ is unknown and estimated by an empirical measure $\nu$ .
Result: The paper bounds the performance loss in terms of the $W_1$ distance between the true noise distribution $\mu$ and the estimate $\nu$ .
Improvement: Under additional regularity conditions (joint Lipschitz continuity of $f$ in state and action), the authors achieve parametric convergence rates ( $O(n^{-1/2})$ ) for the average-cost criterion, which were previously unknown or required stronger assumptions in the literature.

D. Simultaneous Learning of Dynamics and Noise

The paper extends the analysis to cases where both the system dynamics function $r(\cdot)$ and the noise distribution $\mu$ are unknown and must be learned simultaneously from data (e.g., additive noise systems). It provides bounds that account for the estimation error of the function $r$ alongside the noise distribution.

4. Key Results Summary

Scenario	Metric	Bound Type	Key Dependency
General Robustness	$W_1$ Distance	Linear Upper Bound	$\\|c-\hat{c}\\|_\infty + d_{W1}(T, S)$
Discounted Cost	$W_1$	$O(\frac{1}{(1-\beta)^2})$	Lipschitz constants of $J^*$
Average Cost	$W_1$	$O(1)$ (via vanishing discount)	Lipschitz constants of relative value $h^*$
Sample Complexity (i.i.d.)	$N$ samples	$O(N^{-1/2})$	Optimal for finite state approx
Noise Estimation	$W_1(\mu, \nu)$	$O(n^{-1/2})$	Requires joint Lipschitz $f(x,u,w)$

5. Significance and Impact

Bridging Theory and Practice: The paper provides rigorous theoretical justification for model-based reinforcement learning and certainty equivalence control. It quantifies exactly how "close" an estimated model must be to the true model to guarantee near-optimal performance.
Wasserstein Metric Utility: By focusing on the Wasserstein-1 distance, the paper enables robustness analysis in settings where Total Variation convergence fails (common in continuous state spaces with empirical data). This is crucial for modern data-driven control.
Average-Cost Breakthrough: The derivation of sample complexity bounds for the average-cost criterion under empirical learning is a novel contribution, as this setting is notoriously difficult due to the lack of a discount factor to ensure contraction.
Unified Framework: The authors provide a unified proof strategy that handles both discounted and average-cost criteria, as well as both minorization and vanishing-discount approaches, clarifying the shared and distinct assumptions required for each.
Practical Algorithms: The results support the use of state-space quantization and empirical transition estimation as viable strategies for controlling high-dimensional continuous systems, providing concrete sample size requirements to achieve desired performance levels.

In conclusion, this work establishes a comprehensive theoretical foundation for the robustness of optimal control policies against model approximation errors, specifically tailored for data-driven settings where models are learned via empirical distributions and Wasserstein convergence.