Learning to Answer from Correct Demonstrations

Imagine you are trying to learn how to solve a complex puzzle, like a math problem or a coding challenge. You have a teacher (the "demonstrator") who shows you the solution. But here's the catch: there isn't just one correct answer. There are millions of different ways to solve that math problem, all of which are perfect. Your teacher shows you one specific way, but your goal isn't to copy their exact handwriting or word choice; your goal is simply to produce any correct solution.

This paper tackles a fundamental problem in Artificial Intelligence: How do we teach an AI to find a good answer, rather than just copying the teacher's specific answer?

The Old Way: The "Parrot" Approach

Traditionally, when we train AI (like Large Language Models), we use a method called Supervised Fine-Tuning (SFT). Think of this as teaching a parrot.

The Method: You show the parrot a question and the teacher's answer. The parrot tries to mimic the teacher's answer as closely as possible.
The Flaw: This works great if there is only one right answer. But if there are millions of right answers, the parrot gets confused. It tries to memorize the teacher's specific style. If the teacher writes "The answer is 42," the parrot learns to write "The answer is 42." If the teacher wrote "42 is the answer," the parrot might fail to learn that "42" is the core truth.
The Paper's Discovery: The authors prove that this "Parrot" approach (technically called Maximum Likelihood Estimation) often fails when there are many correct answers. It tries to clone the teacher's distribution, which is an impossible and unnecessary task.

The New Way: The "Hedge Fund" Approach

The authors propose a smarter strategy. Instead of trying to guess what the teacher will say next, the AI should try to figure out what makes an answer "good."

Imagine the AI is a Hedge Fund Manager.

The Goal: The manager doesn't care about copying a specific investor's portfolio. They care about profit (the reward).
The Strategy: The manager has a list of possible "market theories" (Reward Models). Some theories say "Buy Tech stocks," others say "Buy Gold."
The Process:
1. The teacher shows a correct answer (e.g., "Buy Gold").
2. The AI checks its list of theories. "Does the 'Buy Gold' theory agree with this? Yes. Does the 'Buy Tech' theory agree? No."
3. The AI punishes the theories that disagree with the teacher's correct answer (by lowering their "weight" or trust).
4. Crucially, the AI also rewards the theories that would have predicted the teacher's answer, even if the AI itself guessed wrong.
5. Over time, the AI keeps the theories that consistently predict "good" answers and discards the bad ones.

Why This is a Big Deal

The paper introduces a concept called the "Reward Class Assumption."

Old Assumption: "The teacher is a genius who always picks from a small, specific set of strategies." (Hard to prove, often false).
New Assumption: "The definition of a 'good answer' comes from a small, manageable set of rules." (Much easier to believe).

The Analogy of the "Perfect Essay":
Imagine a teacher grading essays.

The Parrot (Old Way): Tries to copy the teacher's favorite student's essay word-for-word. If the student used a specific metaphor, the Parrot uses it. If the student made a typo, the Parrot makes it too.
The Hedge Fund (New Way): Tries to understand the rubric. "The teacher likes essays that use metaphors and have no typos." The Hedge Fund doesn't care which metaphor is used, as long as it fits the rubric. It learns to write any essay that gets an A.

The "Optimistic" Speed Boost

The authors also found that their new method is incredibly fast when the teacher is perfect.

Standard Learning: Usually, to get very good, you need to make mistakes and learn from them slowly (like $1/\epsilon^2$ ).
Their Method: If the teacher is always right, the AI learns super fast (like $1/\epsilon$ ). It's like having a "super learner" that only needs to see a few examples to figure out the rules of the game, rather than memorizing every single play.

Summary for the Everyday Person

This paper argues that when teaching AI to solve problems with many correct solutions (like coding, math, or creative writing), we should stop trying to make the AI copy the teacher's style. Instead, we should teach the AI to understand the rules of what makes a solution correct.

By focusing on the "rules of the game" (the reward) rather than "mimicking the player" (the policy), we can build AI that is more robust, learns faster, and doesn't get stuck trying to be a perfect clone of a human who might just be one of many possible experts.

In short: Don't teach the AI to be a photocopier; teach it to be a detective that figures out what "correct" looks like.

1. Problem Definition

The paper addresses the problem of Imitation Learning (or Apprenticeship Learning) in the context of Contextual Bandits.

Setting: A learner receives a sequence of contexts (questions/prompts) $x \in \mathcal{X}$ and must generate an answer (action) $y \in \mathcal{Y}$ .
Goal: The objective is not to perfectly clone the distribution of a demonstrator's answers, but to maximize the expected reward (utility). There may be multiple correct answers for a single question (e.g., different valid math solutions or code implementations).
Data: The learner is provided with offline demonstrations $S = \{(x_i, y_i)\}$ $S = {(x_{i}, y_{i})}$ generated by an expert policy $\bar{\pi}$ $\overset{π}{ˉ}$ .
- The expert may be optimal (always providing a correct answer, i.e., $y_i \in \sigma^*(x_i)$ where $\sigma^*$ is the set of optimal actions) or suboptimal.
- Crucially, the learner does not observe the reward $r^*(x, y)$ directly; they only see the demonstration $y_i$ .
Objective: Learn a policy $\hat{\pi}$ such that its value $V_{r^*}(\hat{\pi})$ is close to the demonstrator's value $V_{r^*}(\bar{\pi})$ , specifically $V_{r^*}(\hat{\pi}) \geq V_{r^*}(\bar{\pi}) - \epsilon$ .

2. Core Assumptions and Motivation

The paper contrasts two primary assumptions regarding the complexity of the underlying model:

Demonstrator Class Assumption (Standard): Assumes the expert policy $\bar{\pi}$ $\overset{π}{ˉ}$ belongs to a low-complexity class $\Pi$ $Π$ .
- Standard Approach: Maximum Likelihood Estimation (MLE) / Behavior Cloning.
- Limitation: This requires the expert's specific behavior (distribution over correct answers) to be simple. In many real-world scenarios (e.g., LLMs), the set of correct answers is vast, and the expert's specific distribution over them might be complex or arbitrary.
Reward Class Assumption (Proposed): Assumes the unknown reward function $r^*$ $r^{*}$ belongs to a low-complexity class $\mathcal{R}$ $R$ .
- Motivation: It is often easier to define what constitutes a "correct" answer (the reward structure) than to model the specific generative process of an expert.
- Claim: The authors argue that assuming a low-cardinality reward class $\mathcal{R}$ is a strictly weaker (and thus more general) assumption than assuming a low-cardinality policy class $\Pi$ , especially when the demonstrator is optimal.

3. Key Findings on Likelihood Maximization (MLE)

The paper rigorously demonstrates that Maximum Likelihood Estimation (MLE), the standard method for Supervised Fine-Tuning (SFT), fails under the Reward Class Assumption, even with optimal demonstrations.

The Failure Mode: MLE attempts to match the distribution of the demonstrator. However, under the Reward Class Assumption, the set of policies consistent with the reward class ( $\Pi_{\mathcal{R}}$ ) can be infinite or extremely large, even if $\mathcal{R}$ is small.
Theorems 1 & 2: The authors construct counterexamples where:
- The reward class $\mathcal{R}$ has size 2 (binary rewards).
- The demonstrator is always correct (value = 1).
- MLE fails to generalize, achieving a value close to 0 on unseen data.
- Reason: MLE overfits to the specific distribution of the demonstrator on seen data and fails to identify the underlying reward structure on unseen contexts, leading to "memorization" rather than "learning."

4. Methodology: The Proposed Learner

The authors propose a new learning algorithm based on Iterative Reward Hedging (similar to Syed and Schapire, 2007, but adapted for a one-pass online setting).

Algorithm 1 (Online Weighted Update):

Initialization: Maintain a weight $w^{(t)}(r) = 1$ for every reward hypothesis $r \in \mathcal{R}$ .
Prediction: For a context $x_t$ , output an action $y_t$ that maximizes the weighted sum of rewards:
$\hat{y}_t = \arg\max_{y} \sum_{r \in \mathcal{R}} w^{(t)}(r) r(x_t, y)$
Update: Receive a demonstration $y_t$ $y_{t}$ (which is correct for the true $r^*$ $r^{*}$ but the learner doesn't know $r^*$ $r^{*}$ ).
- If the demonstrator is optimal (binary case):
  - If $r(x_t, y_t) \neq 1$ (inconsistent with demo), set $w^{(t+1)}(r) = 0$ .
  - If $r(x_t, y_t) = 1$ but the learner's prediction $\hat{y}_t$ was wrong ( $r(x_t, \hat{y}_t) \neq 1$ ), double the weight $w^{(t+1)}(r) = 2w^{(t)}(r)$ .
  - Otherwise, keep the weight unchanged.
- Intuition: The algorithm penalizes hypotheses that are inconsistent with the demo and rewards hypotheses that would have predicted the learner's mistake (thereby identifying the "true" reward structure that distinguishes the learner from the expert).

Algorithm 2 (Statistical Learner):
Uses an Online-to-Batch conversion. The online algorithm is run on the training set, and the final output is a uniform mixture of the policies generated at each step.

5. Key Results and Theoretical Guarantees

A. Sample Complexity

The proposed learner achieves optimal sample complexity that depends logarithmically on the size of the reward class $|\mathcal{R}|$ :

Optimal Demonstrator: Sample complexity is $O\left(\frac{\log |\mathcal{R}|}{\epsilon}\right)$ . This is an "optimistic rate" (linear in $1/\epsilon$ ).
Suboptimal Demonstrator: Sample complexity is $O\left(\frac{\log |\mathcal{R}|}{\epsilon^2}\right)$ .
Comparison: This is strictly better than MLE under the Reward Class Assumption (which fails) and comparable to MLE under the Demonstrator Class Assumption (where MLE works but requires the stronger assumption).

B. Regret and Mistake Bounds

Online Setting: The algorithm makes at most $\log |\mathcal{R}|$ mistakes if the demonstrator is optimal.
General Regret: For arbitrary demonstrations, the regret bound is $O\left(\sqrt{T \log |\mathcal{R}|}\right)$ , interpolating between the optimal and worst-case scenarios based on the demonstrator's suboptimality gap $\Delta$ .

C. Pass@k Extension

The paper extends the method to Pass@k metrics (outputting $k$ answers and checking if at least one is correct).

The algorithm selects $k$ actions greedily based on the weighted majority vote.
The mistake bound improves to $\log_{k+1} |\mathcal{R}|$ , showing that generating multiple candidates significantly reduces the sample complexity required to find a correct answer.

6. Significance and Implications

Rethinking SFT for LLMs: The paper challenges the prevailing paradigm of Supervised Fine-Tuning (SFT) for Large Language Models. SFT typically minimizes log-loss (MLE) to clone the training data distribution. The authors argue that for tasks with multiple correct answers (math, coding, reasoning), cloning the distribution is unnecessary and potentially harmful. The goal should be reward maximization (finding any correct answer), not distribution matching.
Distribution Matching is Impossible: Under the Reward Class Assumption, it is theoretically impossible to perfectly clone the demonstrator's distribution (Theorem/Observation 1), yet it is possible to achieve optimal reward. Therefore, methods relying on distribution matching (like MLE) are fundamentally misaligned with the goal of utility maximization in these settings.
New Learning Paradigm: The work advocates for Iterative Reward Hedging (discriminating between the current policy and the expert via a reward model) over simple behavior cloning. This approach is robust even when the expert's specific behavior is complex or unknown, provided the criteria for correctness (the reward) are simple.
Theoretical Foundation: It provides the first rigorous proof that MLE fails for low-cardinality reward classes and offers a provably optimal alternative with tight sample complexity bounds.

Summary Table

Feature	Standard Approach (MLE/Cloning)	Proposed Approach (Reward Hedging)
Assumption	Low-cardinality Policy Class ( $\Pi$ )	Low-cardinality Reward Class ( $\mathcal{R}$ )
Goal	Match Demonstrator Distribution	Maximize Reward (Utility)
Performance on Optimal Demos	Good (if $\Pi$ is small)	Optimal ( $O(\log \|\mathcal{R}\|/\epsilon)$ )
Performance on Reward Class	Fails (Theorems 1 & 2)	Succeeds
Sample Complexity	Depends on $\|\Pi\|$	Depends on $\log \|\mathcal{R}\|$
Rate Type	$O(1/\epsilon^2)$ generally	Optimistic Rate $O(1/\epsilon)$ for optimal demos
Applicability to LLMs	Standard SFT	Alternative to SFT for multi-solution tasks

In conclusion, the paper establishes that for learning to answer from demonstrations where multiple correct answers exist, one should model the reward structure rather than the expert's behavior. The proposed algorithm achieves this efficiently, offering a theoretically superior alternative to standard Maximum Likelihood Estimation for modern AI alignment and fine-tuning tasks.