Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

This paper proposes a Two-Phase Suffix Imitation framework that enables a reward-free observer to recover optimal policies from a non-stationary learner's actions by discarding initial exploration data, achieving a convergence rate of O~(1/N)\tilde O(1/\sqrt{N}) comparable to fully reward-aware learning.

Yuqi Kong, Xiao Zhang, Weiran Shen

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to figure out how a master chef decides what to cook, but you have a major problem: you can't taste the food.

You only get to watch the chef's hands. You see them pick up a tomato, then a basil leaf, then a knife. You see the final dish, but you never get to know if it was delicious, salty, or burnt. The chef is learning as they go: at first, they are guessing wildly (exploration), but eventually, they figure out the perfect recipe and stick to it (exploitation).

This paper is about solving a puzzle called the Inverse Contextual Bandit. Here is the breakdown in simple terms:

The Problem: The "Noisy" Apprentice

Usually, when we try to learn from an expert (like in AI), we assume the expert is perfect from the start. But in real life, learners (like the chef or a recommendation algorithm) start as novices.

  • Early Days: The learner is confused. They try random things. If you copy their early moves, you learn bad habits.
  • Later Days: The learner becomes an expert. Their moves are perfect.

The challenge for the "Observer" (you, the detective) is that you only see the actions, not the rewards (did the customer like it?). If you try to learn from the entire history of the chef, you will get confused because the early, messy experiments drown out the later, perfect moves.

The Solution: "Two-Phase Suffix Imitation"

The authors propose a clever, slightly counter-intuitive strategy: Throw away the beginning.

Think of it like watching a movie but skipping the first 20 minutes.

  1. Phase 1 (The Burn-in): The observer ignores the first chunk of data. They pretend the learner was just "babbling" or "exploring." They throw this data in the trash.
  2. Phase 2 (The Imitation): The observer only watches the end of the movie (the "suffix"). By this time, the learner has figured out the pattern. The observer copies only these later, high-quality moves.

The Analogy: Imagine a student taking a practice test.

  • Week 1: They guess randomly.
  • Week 10: They know the answers perfectly.
  • The Mistake: If you try to teach a new student by showing them the Week 1 answers, the new student will fail.
  • The Fix: You only show them the Week 10 answers. Even though you threw away 90% of the data, the new student learns faster and better because the data they did get was clean and correct.

The Big Surprise: "Less Data is Better Data"

The most shocking part of this paper is the math.

  • The Learner (the chef) has all the information: they taste the food and know if it's good.
  • The Observer (you) has zero information about taste. You only see the ingredients.

Usually, having less information means you perform worse. But the authors prove that by ignoring the early, noisy data, the Observer can actually catch up to the Learner.

Even though the Observer never tasted a single dish, by only copying the chef's final, confident decisions, they can reconstruct the "perfect recipe" just as well as the chef who tasted everything.

Why This Matters

This is huge for the real world because:

  1. Privacy: Often, we can't see the "rewards" (like user satisfaction scores or medical outcomes) because they are private. We only see the actions (what they clicked or what drug they took).
  2. Efficiency: We don't need to wait for perfect data. We just need to wait until the learner has "calmed down" and stop copying their early mistakes.

The Bottom Line

The paper teaches us that when trying to learn from someone who is still learning, patience is a strategy. Don't try to learn from their whole journey; just wait until they get good, and then copy their final moves. By discarding the "noise" of their early struggles, you can uncover the truth even without seeing the results.