Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Imagine you are a detective trying to figure out how a master chef decides what to cook, but you have a major problem: you can't taste the food.

You only get to watch the chef's hands. You see them pick up a tomato, then a basil leaf, then a knife. You see the final dish, but you never get to know if it was delicious, salty, or burnt. The chef is learning as they go: at first, they are guessing wildly (exploration), but eventually, they figure out the perfect recipe and stick to it (exploitation).

This paper is about solving a puzzle called the Inverse Contextual Bandit. Here is the breakdown in simple terms:

The Problem: The "Noisy" Apprentice

Usually, when we try to learn from an expert (like in AI), we assume the expert is perfect from the start. But in real life, learners (like the chef or a recommendation algorithm) start as novices.

Early Days: The learner is confused. They try random things. If you copy their early moves, you learn bad habits.
Later Days: The learner becomes an expert. Their moves are perfect.

The challenge for the "Observer" (you, the detective) is that you only see the actions, not the rewards (did the customer like it?). If you try to learn from the entire history of the chef, you will get confused because the early, messy experiments drown out the later, perfect moves.

The Solution: "Two-Phase Suffix Imitation"

The authors propose a clever, slightly counter-intuitive strategy: Throw away the beginning.

Think of it like watching a movie but skipping the first 20 minutes.

Phase 1 (The Burn-in): The observer ignores the first chunk of data. They pretend the learner was just "babbling" or "exploring." They throw this data in the trash.
Phase 2 (The Imitation): The observer only watches the end of the movie (the "suffix"). By this time, the learner has figured out the pattern. The observer copies only these later, high-quality moves.

The Analogy: Imagine a student taking a practice test.

Week 1: They guess randomly.
Week 10: They know the answers perfectly.
The Mistake: If you try to teach a new student by showing them the Week 1 answers, the new student will fail.
The Fix: You only show them the Week 10 answers. Even though you threw away 90% of the data, the new student learns faster and better because the data they did get was clean and correct.

The Big Surprise: "Less Data is Better Data"

The most shocking part of this paper is the math.

The Learner (the chef) has all the information: they taste the food and know if it's good.
The Observer (you) has zero information about taste. You only see the ingredients.

Usually, having less information means you perform worse. But the authors prove that by ignoring the early, noisy data, the Observer can actually catch up to the Learner.

Even though the Observer never tasted a single dish, by only copying the chef's final, confident decisions, they can reconstruct the "perfect recipe" just as well as the chef who tasted everything.

Why This Matters

This is huge for the real world because:

Privacy: Often, we can't see the "rewards" (like user satisfaction scores or medical outcomes) because they are private. We only see the actions (what they clicked or what drug they took).
Efficiency: We don't need to wait for perfect data. We just need to wait until the learner has "calmed down" and stop copying their early mistakes.

The Bottom Line

The paper teaches us that when trying to learn from someone who is still learning, patience is a strategy. Don't try to learn from their whole journey; just wait until they get good, and then copy their final moves. By discarding the "noise" of their early struggles, you can uncover the truth even without seeing the results.

Here is a detailed technical summary of the paper "Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation."

1. Problem Definition: Inverse Contextual Bandits (ICB)

The paper addresses the Inverse Contextual Bandit (ICB) problem, a setting where an Observer attempts to recover the underlying optimal policy (or environment parameters) of a Learner without access to the Learner's reward signals.

The Setting:
- The Learner: An adaptive agent interacting with a stochastic linear contextual bandit environment. The Learner observes contexts, selects actions, and receives rewards, using this feedback to update its policy (e.g., via LinUCB or LinTS).
- The Observer: A passive external entity that only observes the sequence of context-action pairs $(A_t, X_t, \hat{a}_t)$ generated by the Learner. The Observer never sees the rewards $r_t$ .
The Core Challenge:
- Non-Stationarity: The Learner's behavior evolves over time. Early rounds are dominated by exploration (high noise, suboptimal actions), while later rounds converge toward exploitation (low noise, near-optimal actions).
- Information Deficit: The Observer lacks the reward signal required for standard Inverse Reinforcement Learning (IRL) or Behavior Cloning (BC).
- The Failure of Naive Approaches: Standard IRL/BC methods assume data is generated by a stationary, optimal expert. Applying these directly to bandit logs fails because they treat early, noisy exploratory data as optimal demonstrations, leading to poor policy recovery.

2. Methodology: Two-Phase Suffix Imitation

To overcome the non-stationarity and lack of rewards, the authors propose a framework called Two-Phase Suffix Imitation.

Core Insight: "Less data can be better data." By discarding the initial exploratory phase, the Observer can significantly improve the signal-to-noise ratio.
The Strategy:
1. Phase I (Burn-In): The Observer ignores the first $T(N)$ rounds of interaction. During this time, the Learner is exploring, and the data is considered unreliable.
2. Phase II (Imitation): The Observer collects data from rounds $T(N)+1$ to $N$ . By this stage, the Learner has converged sufficiently, and its actions serve as a noisy but bounded proxy for the optimal policy.
Algorithm (Suffix Imitation via ERM):
- The Observer treats the Learner's chosen actions in Phase II as "labels" for the contexts.
- It performs Empirical Risk Minimization (ERM) to find a parameter vector $\tilde{\theta}$ that minimizes the 0-1 imitation loss:
  $\tilde{\theta} \in \arg \min_{\theta} \frac{1}{L(N)} \sum_{t=T(N)+1}^{N} \mathbb{I}[\pi_\theta(A_t, X_t) \neq \hat{a}_t]$
  where $L(N) = N - T(N)$ is the effective sample size.
- The resulting policy is $\tilde{\pi}(\cdot) = \arg \max_a \langle x_a, \tilde{\theta} \rangle$ .

3. Theoretical Analysis & Key Assumptions

The paper provides finite-sample guarantees for the Observer's performance, measured by Predictive Regret (the expected difference between the optimal reward and the reward of the recovered policy).

Dynamic Massart Noise Condition:
- The authors assume the Learner's error probability decreases over time. Specifically, there exists a non-increasing function $\eta(T)$ such that for $t > T$ , the probability of the Learner making a mistake is bounded: $P(\hat{a}_t \neq a^*_t) \le \eta(T)$ .
- Crucially, they require $\eta(T) < 1/2$ after the burn-in period, ensuring the Learner is "more right than wrong."
Key Theoretical Results:
- Lemma 2 (Noise Transfer): Under the Massart condition, the "clean" error rate (deviation from the true optimal arm) is bounded by the "noisy" imitation error (deviation from the Learner's action), scaled by a factor of $(1 - 2\eta(T))^{-1}$ .
- Theorem 5 (Regret Bound): The Observer achieves a predictive regret bound of:
  $\rho(\tilde{\pi}) \le \frac{C}{1 - 2\eta(T)} \sqrt{\frac{d \log K \cdot \log L(N)}{L(N)}}$
- Corollary 1 (Asymptotic Efficiency): If the Learner achieves a standard sublinear regret (e.g., $\tilde{O}(\sqrt{T})$ ) and the burn-in period is set conservatively (e.g., $T(N) = \Theta(N^\alpha)$ with $\alpha \in (0,1)$ ), the Observer achieves a convergence rate of $\tilde{O}(1/\sqrt{N})$ .
- Significance: This rate matches the asymptotic efficiency of a fully reward-aware Learner, despite the Observer having zero reward feedback.

4. Experimental Results

The authors validated the framework on linear contextual bandit environments using LinUCB and LinTS as Learners.

Burn-in Trade-off: Experiments showed a U-shaped curve for error vs. burn-in length ( $\alpha$ $α$ ).
- Too small $\alpha$ (no burn-in): High error due to noise from exploration.
- Too large $\alpha$ (too much burn-in): High error due to insufficient sample size.
- Optimal $\alpha$ : An intermediate value (e.g., $T = N^{0.9}$ ) balances label quality and sample quantity.
Comparison with Learner:
- Naive Imitation (No Burn-in): Performed significantly worse than the Learner.
- Suffix Imitation (Rule-based): With $T=N^{0.9}$ , the Observer's parameter estimation error converged to a level comparable to the Learner itself, and in some cases (Oracle burn-in), even outperformed the Learner's online baseline.
Conclusion: The Observer successfully recovered the true decision boundaries and parameters solely from action logs, effectively "filtering out" the exploration noise.

5. Key Contributions

Formalization of ICB: Defined the Inverse Contextual Bandit problem where an observer learns from a non-stationary, reward-free setting.
Two-Phase Suffix Imitation: Proposed a simple, effective strategy that discards exploratory data to handle non-stationarity, transforming the problem into a bounded-noise imitation task.
Theoretical Guarantees: Proved that a reward-free observer can achieve the same $\tilde{O}(1/\sqrt{N})$ convergence rate as a reward-aware learner, provided the burn-in period is chosen correctly.
Empirical Validation: Demonstrated that passive observers can recover optimal policies with accuracy comparable to active learners, challenging the intuition that reward signals are strictly necessary for high-performance policy recovery.

6. Significance

This work fundamentally shifts the perspective on Learning from Demonstrations in sequential decision-making. It demonstrates that:

Reward signals are not always necessary: In adaptive learning environments, the actions of a converging agent encode sufficient information to recover the underlying utility function.
Non-stationarity is a feature, not just a bug: By recognizing that early data is "noisy" and late data is "clean," one can design simple filtering mechanisms (suffix imitation) that outperform complex IRL methods designed for stationary experts.
Practical Application: This has significant implications for auditing AI systems, reverse-engineering recommendation algorithms, and clinical trial analysis where reward data (e.g., patient outcomes) may be private, delayed, or unavailable, but interaction logs are accessible.

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

The Problem: The "Noisy" Apprentice

The Solution: "Two-Phase Suffix Imitation"

The Big Surprise: "Less Data is Better Data"

Why This Matters

The Bottom Line

1. Problem Definition: Inverse Contextual Bandits (ICB)

2. Methodology: Two-Phase Suffix Imitation

3. Theoretical Analysis & Key Assumptions

4. Experimental Results

5. Key Contributions

6. Significance

More like this

BEFANA: A Tool for Biodiversity-Ecosystem Functioning Assessment by Network Analysis

Riemannian Laplace Approximation with the Fisher Metric

Fast Fishing: Approximating BAIT for Efficient and Scalable Deep Active Image Classification

Graph machine learning for flight delay prediction due to holding manouver

Deep Learning for Clouds and Cloud Shadow Segmentation in Methane Satellite and Airborne Imaging Spectroscopy