Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Imagine you are trying to teach a robot how to play a video game. Usually, you'd let the robot play thousands of times, learn from its mistakes, and get better. This is Reinforcement Learning (RL).

But what if you can't let the robot play? What if you only have a video recording of a human playing the game once, and you have to teach the robot just by watching that one video? This is called Offline Reinforcement Learning.

The problem is: The human in the video might only have played the "easy" levels or taken specific paths. If the robot tries to play a level the human never visited, it might get lost or make a terrible mistake because it has no data on what happens there.

This paper is about a new, smarter way to teach the robot using that single video recording, specifically when we want the robot to be creative (explore new moves) but also safe (stick close to what the human did).

The Two Main Characters: The "Strict Teacher" and the "Creative Coach"

The paper looks at two different ways to teach the robot, using a concept called Regularization. Think of this as a rule we add to the robot's learning process to keep it in check.

1. The "Strict Teacher" (Reverse KL Divergence)

This is the most common method used today. Imagine a strict teacher who says: "You can try new things, but you must stay very close to the path the human took. If you wander too far, you get a huge penalty."

The Old Problem: Previous research said, "To teach the robot well with this strict teacher, the human video must show every single possible move in the game." If the human skipped even one corner of the map, the robot would fail. This is a very high bar that is hard to meet in real life.
The New Discovery: This paper proves you don't need the human to show everything. You only need the human to show the best path (the optimal path).
The Secret Sauce: The authors invented a new teaching method called "Pessimism."
- Analogy: Imagine the robot is a nervous hiker. Instead of assuming the path is safe, the robot assumes the worst: "If I haven't seen this trail in the video, it's probably a cliff."
- By being overly cautious about unknown areas, the robot naturally avoids wandering off into the dark. This allows it to learn perfectly well even if the human video only covered the "best" route, not every single nook and cranny.
- Result: They proved this is the absolute best way to do it. You can't do better than this with this specific "Strict Teacher."

2. The "Creative Coach" (Strongly Convex f-Divergence)

This is a newer, more advanced method. Imagine a coach who says: "You can try new things, but if you stray from the human's path, the penalty grows exponentially."

The Magic: Unlike the "Strict Teacher," this coach uses a mathematical trick (strong convexity) that makes the penalty for wandering off so steep that the robot physically cannot go far from the human's path, even if it wanted to.
The Big Breakthrough: The authors found that with this "Creative Coach," you don't need to worry about the video coverage at all!
- Analogy: It's like the robot is wearing a super-strong elastic leash. No matter how much it tries to run, the leash snaps it back to the human's path instantly. Because the leash is so strong, it doesn't matter if the human walked in a straight line or a zigzag; the robot learns the right moves perfectly without needing a "map" of every possible location.
- Result: They proved that with this method, the robot learns just as fast as theoretically possible, regardless of how "sparse" or limited the human's video is.

The "Speedometer" of Learning

In the world of AI, we measure how good an algorithm is by how many "samples" (video frames) it needs to learn.

Old Way: Needed $1/\epsilon^2$ samples (Slow). If you wanted to be twice as accurate, you needed four times the data.
This Paper's Way: Needs $1/\epsilon$ samples (Fast). If you want to be twice as accurate, you only need twice the data.

The authors showed that:

With the Strict Teacher, you can achieve this "Fast" speed, but only if you use their new "Pessimistic" method and the human video covers the best path.
With the Creative Coach, you can achieve this "Fast" speed without needing any specific coverage conditions. It just works.

The Real-World Test

The authors didn't just do math; they tested it.

They simulated a robot playing a simple game and a complex game (using images of handwritten digits).
They compared the "Strict Teacher" (who needed a lot of data if the human was picky) vs. the "Creative Coach" (who learned fast even with a very picky human).
The Result: The math held up. The "Creative Coach" was incredibly robust, and the "Strict Teacher" worked perfectly when they used their new pessimistic strategy.

Why Should You Care?

This is a huge step forward for AI Safety and Efficiency.

Efficiency: We can train powerful AI models (like the ones that write code or chat with us) using much less data. We don't need millions of perfect examples; we just need good examples of the "best" behavior.
Safety: By understanding exactly how much data we need, we can build AI that is less likely to hallucinate or make dangerous mistakes when it encounters something it hasn't seen before.

In a nutshell: This paper figured out the exact rules for teaching a robot from a single video. It showed that if you make the robot a little bit "scared" of the unknown (Pessimism), or if you use a really strong leash (Strong Convexity), you can teach it to be a master with very little data.

1. Problem Statement

The paper addresses the problem of offline policy learning in contextual bandits where the learning objective is regularized by an $f$ -divergence.

Context: In offline Reinforcement Learning (RL), agents learn from a fixed dataset collected by a behavior policy ( $\pi_{ref}$ ) without further interaction. A major challenge is the distributional shift between the behavior policy and the target policy.
Objective: The goal is to find a policy $\pi$ that maximizes a regularized objective:
$J(\pi) = \mathbb{E}_{(s,a)\sim \rho \times \pi}[r(s,a)] - \eta^{-1} D_f(\pi(\cdot|s) \| \pi_{ref}(\cdot|s))$
where $D_f$ is an $f$ -divergence (e.g., KL divergence, $\chi^2$ -divergence), $\eta$ is an inverse temperature parameter, and $\pi_{ref}$ is a reference policy.
Core Question: What is the weakest data coverage condition (concentrability) required to achieve the optimal sample complexity of $\tilde{\Theta}(\epsilon^{-1})$ for finding an $\epsilon$ -optimal policy? Previous works often required strong "all-policy concentrability" or resulted in sub-optimal $\tilde{O}(\epsilon^{-2})$ rates under weaker conditions.

2. Methodology

The authors analyze two distinct subclasses of $f$ -divergence regularizers, employing different algorithmic and analytical techniques for each:

A. Reverse KL-Divergence Regularization ( $f(x) = x \log x$ )

Algorithm: KL-PCB (Offline KL-Regularized Pessimistic Contextual Bandits).
- It uses a pessimism-based approach.
- First, it computes a least-squares estimator $\bar{g}$ of the reward function from the offline dataset.
- It constructs a pessimistic estimator $\hat{g} = \bar{g} - \Gamma_n$ , where $\Gamma_n$ is a bonus term derived from the confidence radius and a $D^2$ -divergence measure.
- The output policy is derived by maximizing the regularized objective using $\hat{g}$ .
Analysis Technique:
- Pessimism + Curvature: The analysis couples the pessimistic estimator with the strong convexity of the KL-divergence (with respect to Total Variation distance).
- Moment-Based Refinement: A key innovation is a novel moment-based analysis (Lemma 2.15). By exploiting the fact that the pessimistic error $\Delta = \hat{g} - g^*$ is non-positive, the authors show that the sub-optimality gap can be bounded by the second moment of the error under the optimal policy ( $\pi^*$ ), rather than requiring uniform control over all policies.
- Result: This allows the algorithm to bypass the need for "all-policy concentrability" and rely only on single-policy concentrability.

B. Strongly Convex $f$ -Divergence Regularization

Algorithm: f-CB (Offline f-divergence Regularized Contextual Bandits).
- This algorithm is pessimism-free.
- It simply computes the least-squares estimator $\bar{g}$ and outputs the policy that maximizes the regularized objective directly using $\bar{g}$ .
Analysis Technique:
- Dual-Bregman Perspective: Since strongly convex $f$ does not always yield a closed-form optimal policy (unlike KL), the authors use convex duality.
- They analyze the regret via the Bregman divergence of the convex conjugate of the regularizer.
- By leveraging the strong convexity of $f$ , they bound the Hessian of the conjugate function, showing that the error depends only on the behavior policy's distribution ( $\pi_{ref}$ ), effectively eliminating the dependency on concentrability constants.

3. Key Contributions

Sharp Sample Complexity for Reverse KL:
- The paper establishes that single-policy concentrability is both sufficient and necessary to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for reverse KL-regularized bandits.
- This improves upon previous $\tilde{O}(\epsilon^{-2})$ bounds or $\tilde{O}(\epsilon^{-1})$ bounds that required stronger "all-policy concentrability."
- They provide a matching lower bound proving that the multiplicative dependency on the single-policy concentrability constant ( $C_{\pi^*}$ ) is unavoidable.
Coverage-Free Learning for Strongly Convex $f$ :
- For $f$ -divergences where $f$ is strongly convex (e.g., $\chi^2$ -divergence), the authors prove that the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity is achievable without any concentrability assumption.
- The sample complexity depends only on the strong convexity modulus $\alpha$ and the inverse temperature $\eta$ , not on the data coverage ratio.
Novel Analytical Tools:
- Introduction of a moment-based machinery (Lemma 2.15) that exploits the sign of the pessimistic error to refine mean-value-type risk bounds.
- Application of dual-Bregman analysis to handle non-closed-form optimal policies in the strongly convex setting.
Extension to Dueling Bandits:
- The theoretical insights and algorithms are extended to Contextual Dueling Bandits (CDBs), where feedback is pairwise preference rather than absolute rewards. Similar sharp bounds are achieved for both KL and strongly convex $f$ -divergences.

4. Main Results

Regularizer Type	Algorithm	Coverage Condition	Sample Complexity (Upper Bound)	Lower Bound
Reverse KL	KL-PCB (Pessimistic)	Single-Policy ( $C_{\pi^*}$ )	$\tilde{O}(\eta C_{\pi^*} \epsilon^{-1})$	$\Omega(\eta C_{\pi^*} \epsilon^{-1})$
Reverse KL	Existing Methods	All-Policy ( $D^2$ )	$\tilde{O}(\eta D^2 \epsilon^{-1})$ or $\tilde{O}(\epsilon^{-2})$	-
Strongly Convex $f$	f-CB (Non-Pessimistic)	None (Unconditional)	$\tilde{O}(\alpha^{-1} \eta \epsilon^{-1})$	$\Omega(\alpha^{-1} \eta \epsilon^{-1})$

Note: $\tilde{O}$ hides polylogarithmic factors.

5. Significance and Impact

Theoretical Resolution: The paper resolves a long-standing open problem regarding the precise data coverage requirements for regularized offline RL. It clarifies that while KL regularization benefits from strong convexity, it still fundamentally requires some coverage of the optimal policy's support. In contrast, stronger regularizers (strongly convex $f$ ) can completely decouple sample complexity from data coverage.
Algorithmic Simplicity: The work demonstrates that for certain regularizers (strongly convex $f$ ), complex pessimistic mechanisms are unnecessary; simple least-squares estimation suffices for optimal rates.
Practical Implications:
- For KL-regularized RLHF (common in LLM alignment), the results suggest that ensuring the reference policy covers the optimal policy's support is critical for fast convergence.
- For $\chi^2$ -divergence or similar strongly convex regularizers, practitioners can achieve robust performance even with limited data coverage, provided the regularization strength is tuned correctly.
Experimental Validation: Numerical experiments on multi-armed, linear, and real-world (MNIST) bandits confirm the theoretical scaling rates ( $\epsilon \propto n^{-1}$ ) and the distinct behavior regarding concentrability (KL depends on coverage, $\chi^2$ does not).

In summary, this paper provides a comprehensive and sharp characterization of the statistical efficiency of offline policy learning under $f$ -divergence regularization, distinguishing clearly between the requirements for KL divergence and strongly convex alternatives.

Towards a Sharp Analysis of Offline Policy Learning for fff-Divergence-Regularized Contextual Bandits

The Two Main Characters: The "Strict Teacher" and the "Creative Coach"

1. The "Strict Teacher" (Reverse KL Divergence)

2. The "Creative Coach" (Strongly Convex f-Divergence)

The "Speedometer" of Learning

The Real-World Test

Why Should You Care?

1. Problem Statement

2. Methodology

A. Reverse KL-Divergence Regularization (f(x)=xlog⁡xf(x) = x \log xf(x)=xlogx)

B. Strongly Convex fff-Divergence Regularization

3. Key Contributions

4. Main Results

5. Significance and Impact

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields

Towards a Sharp Analysis of Offline Policy Learning for $f$ -Divergence-Regularized Contextual Bandits

A. Reverse KL-Divergence Regularization ( $f(x) = x \log x$ )

B. Strongly Convex $f$ -Divergence Regularization