Sparse Offline Reinforcement Learning with Corruption Robustness

Imagine you are trying to teach a robot to play a complex video game (like a strategy game or a racing simulator). You can't let the robot play the game millions of times to learn by trial and error because it's too expensive or dangerous. Instead, you give the robot a notebook filled with recordings of how a human played the game in the past. This is called Offline Reinforcement Learning.

However, there are two big problems with this notebook:

The "Noise" Problem: The notebook is huge, but most of it is useless. Imagine a notebook with 10,000 pages, but only 10 pages actually contain the secret strategy to win. The other 9,990 pages are just random scribbles. This is the Sparse problem: the data is high-dimensional (huge) but only a tiny, sparse part of it matters.
The "Sabotage" Problem: Imagine a mischievous prankster got hold of your notebook before you gave it to the robot. They took 10% of the pages, ripped them up, and replaced them with fake instructions designed to make the robot crash. This is Corruption.

The paper you shared tackles a very difficult question: Can we teach the robot to play near-perfectly using this messy, huge, and sabotaged notebook, even if we don't have enough clean pages to cover every possible situation?

The Old Way: The "Over-Paranoid" Coach (LSVI)

For a long time, the standard way to teach robots from notebooks was a method called LSVI (Least-Square Value Iteration). Think of this as a coach who is extremely paranoid.

How it works: The coach looks at every single move in the notebook. If there is any doubt about whether a move is safe, the coach assumes the worst possible outcome and penalizes that move heavily.
The Flaw: In a normal, small notebook, this paranoia works. But in our "Sparse" notebook (10,000 pages, only 10 useful), the coach gets confused. Because the coach doesn't know which 10 pages are the real ones, they try to be paranoid about every single page.
The Result: The coach becomes so scared of the fake pages that they start punishing the real good moves too. They end up telling the robot to do nothing, or to play terribly, because the "safety bonus" (the penalty for uncertainty) becomes so huge it breaks the math. It's like a coach telling a player, "Don't run, don't jump, don't breathe, because you might get hurt," resulting in a player who never moves.

The New Way: The "Smart Scout" (Actor-Critic)

The authors of this paper propose a new method called Sparse Robust Actor-Critic. Instead of one paranoid coach, they use a team with two roles:

The Actor (The Player): This is the robot trying to learn the strategy.
The Critic (The Scout): This is the coach who evaluates the player.

Here is the magic trick:
Instead of the Scout being paranoid about every single move in the entire universe, the Scout only looks at the moves the current Player is actually trying to make.

The Analogy: Imagine the Player is trying to find a path through a dense forest. The Scout doesn't need to know if the entire forest is safe. The Scout only needs to check if the specific path the Player is walking on is safe.
Why it works with Sparse Data: Because the Player only cares about a few specific paths (the "sparse" part), the Scout can ignore the 9,990 useless pages and focus only on the 10 pages that matter.
Why it works with Corruption: The Scout uses a special "Truth Detector" (a robust estimator). Even if the prankster swapped some pages, the Truth Detector can look at the pattern of the remaining pages and figure out the real strategy, ignoring the fake ones.

The Big Breakthrough

The paper proves two amazing things:

It works when data is scarce: Even if you have fewer pages in the notebook than there are possible moves in the game (a situation where old methods fail completely), this new method can still find the winning strategy.
It survives sabotage: Even if a significant chunk of the notebook is fake, the method can still learn a near-perfect strategy.

Summary in a Nutshell

The Problem: We have a huge, messy, and partially fake notebook of past experiences, and we need to learn a winning strategy from it.
The Old Solution: A paranoid coach who tries to be safe about everything gets overwhelmed and fails.
The New Solution: A smart team where the coach only checks the safety of the specific moves the player is actually making. This allows them to ignore the noise, filter out the fakes, and find the hidden "sparse" truth.

This paper is a big deal because it's the first time we've mathematically proven that you can learn effectively from a corrupted, high-dimensional dataset without needing a perfect, massive amount of clean data. It's like teaching a robot to win a game using a notebook that was dropped in a mud puddle and then shredded by a dog, and still coming out with a perfect strategy.

Here is a detailed technical summary of the paper "Sparse Offline Reinforcement Learning with Corruption Robustness".

1. Problem Definition

The paper addresses the challenge of Offline Reinforcement Learning (RL) in high-dimensional, sparse Markov Decision Processes (MDPs) where the dataset is subject to adversarial corruption.

Setting: The environment is modeled as a linear MDP with a feature map $\phi: \mathcal{X} \times \mathcal{A} \to \mathbb{R}^d$ . The dimension $d$ is large (potentially $d \gg N$ , where $N$ is the number of trajectories), but the underlying model is $s$ -sparse (only a small subset of features $S \subset [d]$ with $|S|=s \ll d$ influences the reward and transitions).
Corruption: An adversary can arbitrarily corrupt up to an $\epsilon$ -fraction of the collected trajectories (data poisoning).
Coverage: The paper focuses on the single-policy concentrability regime. Unlike "uniform coverage" (where data covers the entire state-action space well), here data is only guaranteed to cover the trajectory distribution of a single policy (e.g., the optimal policy). This is a realistic but challenging setting for offline RL.
Goal: Estimate a near-optimal policy $\hat{\pi}$ such that the suboptimality gap $V^*(x_1) - V^{\hat{\pi}}(x_1)$ is small, with guarantees that depend on the sparsity $s$ rather than the ambient dimension $d$ .

2. Methodology

The authors propose a Sparse Robust Actor-Critic (AC) framework. The methodology is built upon three core components:

A. Sparse Robust Linear Estimators (SRLE)

The authors utilize robust regression oracles to estimate parameters from corrupted data. They analyze three variants:

SRLE1: Computationally efficient, works under uniform coverage (well-conditioned covariance).
SRLE2: Statistically optimal (minimax rate) but computationally expensive (NP-hard subset selection). Works under general (ill-conditioned) covariance.
SRLE3: Computationally efficient (polynomial time) but with slightly looser statistical guarantees ( $O(\sqrt{\epsilon})$ vs $O(\epsilon)$ ). Works under general covariance.

B. Failure of Sparse Robust LSVI

The paper first demonstrates that the standard Least-Square Value Iteration (LSVI) approach fails in this setting.

Mechanism: LSVI relies on pointwise pessimistic bonuses ( $\Gamma_h(x,a)$ ) to ensure the value function is an underestimate.
The Flaw: In sparse settings with weak coverage (single-policy concentrability), the true support of the error is unknown. To guarantee pessimism, LSVI must maximize over all possible sparse supports. This leads to an overly pessimistic bonus that scales with $\sqrt{d}$ (or $\sqrt{s \log d}$ ) rather than just $s$ , rendering the suboptimality bound vacuous when $d > N$ .

C. Sparse Robust Actor-Critic (AC) Framework

To overcome the limitations of LSVI, the authors propose an Actor-Critic algorithm that avoids pointwise bonuses:

Actor: Uses a log-linear policy class updated via Mirror Descent.
Critic: Instead of enforcing pessimism at every state-action pair, the critic solves a constrained optimization problem (PessOpt) that ensures the value function is pessimistic only at the initial state for the current actor's policy.
Key Innovation: By fixing the policy $\pi$ first, the algorithm only needs to control the regression error along the trajectory of $\pi$ . This allows the integration of sparse robust oracles (SRLE) without the "over-pessimism" penalty. The error analysis leverages the sparse single-policy concentrability assumption, transferring the error bound from the data distribution to the policy's occupancy measure with a factor of $\sqrt{\kappa}$ (condition number), avoiding dependence on $d$ .

3. Key Contributions

Theoretical Separation of LSVI and AC: The paper provides the first rigorous proof that LSVI fails in high-dimensional sparse offline RL under single-policy concentrability due to the incompatibility of pointwise bonuses with sparsity, whereas Actor-Critic methods succeed.
First Non-Vacuous Guarantees: The authors establish the first suboptimality bounds for sparse offline RL with corruption that are non-vacuous in the regime $d > N$ under single-policy concentrability.
Robustness to Corruption: The proposed AC algorithm remains robust even when an $\epsilon$ fraction of data is corrupted, achieving bounds that scale with the sparsity $s$ and corruption level $\epsilon$ , rather than the ambient dimension $d$ .
Algorithmic Trade-offs: The paper characterizes the trade-off between computational efficiency and statistical optimality in the presence of corruption and weak coverage, offering both a statistically optimal (but slow) and a computationally efficient (slightly slower rate) solution.

4. Main Results

The paper derives suboptimality gaps of the form $SubOpt(\pi^*, \hat{\pi})$ .

Under Uniform Coverage (Assumption 2.2):
Using SRLE1 (efficient), the AC algorithm achieves:
$\tilde{O}\left( \frac{H^2 s \sqrt{\epsilon}}{\xi} + \frac{H^2 s}{\xi\sqrt{N}} \right)$
This scales with sparsity $s$ , not dimension $d$ .

Under Single-Policy Concentrability (Assumption 2.3):
This is the main contribution. The results depend on the choice of oracle:

With SRLE2 (Statistically Optimal, Computationally Expensive):
$\tilde{O}\left( H^2 \sqrt{\kappa s \epsilon} + H^2 \sqrt{\frac{\kappa s \log(dHN/\delta)}{N^{1/4}}} \right)$
- Achieves a rate of $\tilde{O}(\sqrt{s\epsilon})$ in the corruption term.
- Independent of $d$ in the leading terms.
With SRLE3 (Computationally Efficient):
$\tilde{O}\left( H^2 \sqrt{\kappa s \epsilon^{1/4}} + H^2 \sqrt{\frac{\kappa s \log(dHN/\delta)}{N^{1/4}}} \right)$
- The corruption term degrades to $\epsilon^{1/4}$ , but the algorithm runs in polynomial time.

Comparison with LSVI:
If one attempts to use LSVI with single-policy concentrability, the suboptimality gap contains a term scaling with $\sqrt{d}$ (or $\sqrt{s \log d}$ ), making the bound vacuous when $d \gg N$ .

5. Significance

Bridging Theory and Practice: Real-world datasets are often high-dimensional, sparse, and potentially corrupted. Existing robust offline RL theories often assume $N \gg d$ or uniform coverage, which are unrealistic. This work bridges that gap.
Paradigm Shift: It challenges the dominance of LSVI in robust offline RL, showing that for sparse, high-dimensional problems, Actor-Critic architectures are theoretically superior because they naturally accommodate sparsity without requiring overly conservative pointwise bounds.
Robustness: It demonstrates that learning near-optimal policies is possible even when data is scarce ( $N < d$ ) and corrupted, provided the underlying model is sparse and the algorithm is designed to exploit that structure.
Future Directions: The paper identifies the computational bottleneck of the $\ell_0$ -constraint in the critic's optimization problem (Equation 14) and suggests that relaxing this constraint or finding better convex surrogates is a key area for future research to achieve polynomial-time solutions with optimal statistical rates.

Sparse Offline Reinforcement Learning with Corruption Robustness

The Old Way: The "Over-Paranoid" Coach (LSVI)

The New Way: The "Smart Scout" (Actor-Critic)

The Big Breakthrough

Summary in a Nutshell

1. Problem Definition

2. Methodology

A. Sparse Robust Linear Estimators (SRLE)

B. Failure of Sparse Robust LSVI

C. Sparse Robust Actor-Critic (AC) Framework

3. Key Contributions

4. Main Results

5. Significance

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning