Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Imagine you have a very smart, well-read student (the Base Model) who has studied a massive library of books. This student is great at answering questions based on what they've already read. However, they sometimes struggle with brand-new types of questions or complex problems that require a bit of creative reasoning beyond their existing knowledge.

To make this student even better, you decide to hire a Tutor (the Reward Model) to give them feedback. The paper you're asking about investigates how this tutoring process works, specifically looking at two different ways the tutor can give feedback: Outcome Rewards and Process Rewards.

Here is the breakdown of the paper's findings using simple analogies.

1. The Two Ways to Tutor

Outcome Rewards: The "Final Grade" Approach

Imagine the student writes an entire essay (a sequence of words). The tutor only reads the final essay and gives a simple grade: "Pass" or "Fail."

The Problem: If the student gets a "Fail," they have no idea where they went wrong. Did they mess up the first sentence? The middle? The conclusion?
The Paper's Finding: If the student's initial draft was already decent (they had a "non-trivial likelihood" of being right), this method works well. The student can tweak their writing to get a "Pass."
The Barrier: However, if the student's initial draft was completely wrong (like writing gibberish), the tutor's "Fail" grade gives them almost no useful information. To fix a completely wrong essay from scratch, the student would have to guess and check exponentially many times (like trying every possible combination of letters in the alphabet). This is the "Base Model Barrier." If the student doesn't already know the answer, this method can't teach them.

Process Rewards: The "Step-by-Step" Approach

Now, imagine the tutor is sitting next to the student as they write. After every single word (token) the student writes, the tutor says, "Good word!" or "Bad word!"

The Advantage: If the student writes a wrong word, the tutor stops them immediately. The student can correct that specific word before moving on.
The Paper's Finding: This method is a game-changer. Even if the student starts with a bad idea, the tutor can guide them step-by-step to the correct answer. The student doesn't need to guess the whole essay at once; they just need to get the next word right.
The Result: This avoids the "Base Model Barrier." The student can learn to solve problems they couldn't solve before, and the number of attempts needed grows much more slowly (linearly) rather than exploding exponentially.

2. The "Likelihood Quantile" (The "Confidence Meter")

The paper introduces a concept called Likelihood Quantile (LQ). Think of this as a Confidence Meter for the student.

The Scenario: You ask the student a question.
- High Confidence: The student thinks, "I'm 90% sure the answer is 'Paris'." (This is an "on-support" sample).
- Low Confidence: The student thinks, "I have no idea, maybe 'Paris', maybe 'Tokyo', maybe 'Mars'?" (This is an "off-support" sample).
The Finding:
- With Outcome Rewards (Final Grade), if the student's confidence is low, the tutor can't help them much. The student is stuck in a loop of guessing.
- With Process Rewards (Step-by-Step), the tutor can boost the student's confidence one word at a time. Even if the student starts with low confidence, the tutor can guide them to a high-confidence correct answer.

3. The "Curse of Dimensionality" (The "Needle in a Haystack")

The paper explains why the "Final Grade" method fails for hard problems using a metaphor of a Haystack.

The Problem: Imagine the correct answer is a single needle in a giant haystack.
- If the student is already holding a piece of hay that is close to the needle (the Base Model is good), the "Final Grade" tutor can help them find the needle quickly.
- If the student is holding a random piece of straw far away from the needle, the "Final Grade" tutor just says "Wrong." The student has to throw away the straw and pick a new one. Since there are so many pieces of straw (exponential possibilities), they might never find the needle.
The Solution: The "Step-by-Step" tutor acts like a metal detector. They don't wait until the end; they scan the hay as the student picks it up. They tell the student, "No, that's not it, try this one," immediately. This turns an impossible search into a manageable walk.

4. The "Base Model Barrier" Explained Simply

The paper's most important conclusion is this: You cannot teach a student to solve a problem they have zero intuition about using only "Final Grades."

If the student's pre-training (their initial knowledge) is weak for a specific topic, simply giving them a "Pass/Fail" grade on the final answer won't help them learn. They will hit a wall.
To break through this wall, you need Process Rewards (feedback on the steps). This allows the student to build the solution from the ground up, even if they started with nothing.

Summary: What Should We Do?

If the AI is already pretty good at a task (like writing a poem or solving a simple math problem), using Outcome Rewards (checking the final answer) is efficient and works well.
If the AI is struggling or needs to learn something completely new (like complex reasoning or coding a new algorithm), you must use Process Rewards (checking the steps). Without this step-by-step guidance, the AI will likely get stuck and never learn the new skill, no matter how much you try to train it.

In a nutshell: You can't just grade the final exam to teach a student a new subject; you need to grade their homework and quiz them on every step along the way. This paper proves mathematically that this "step-by-step" approach is the only way to break through the limits of what an AI already knows.

Here is a detailed technical summary of the paper "Post-Training with Policy Gradients: Optimality and the Base Model Barrier."

1. Problem Statement

The paper investigates the theoretical limits of Reinforcement Learning (RL) post-training for Large Language Models (LLMs), specifically focusing on linear autoregressive models. The core problem is determining whether RL with outcome rewards (verifying the final answer) can effectively generate responses that lie outside the support of the pre-trained base model (i.e., creating "new" knowledge not present in the base distribution).

The authors contrast two scenarios:

On-Support: The base model already assigns a non-trivial probability (likelihood) to the correct response.
Off-Support: The base model assigns a negligible (exponentially small) probability to the correct response.

The study aims to characterize the sample complexity (number of reward queries) and iteration complexity required for Policy Gradient (PG) methods to reduce test error, and whether these complexities depend fundamentally on the quality of the base model.

2. Methodology and Framework

Model Setup

Generative Model: The authors use a linear autoregressive model where the probability of a sequence $y = (y_1, \dots, y_N)$ given context $x$ is decomposed into token-level probabilities:
$p_w(y|x) = \prod_{i=1}^N p_w(y_i | x, y_{1:i-1})$
The model assumes a $\gamma$ -margin condition (Assumption 1): There exists a ground truth parameter $w^*$ such that the correct token is separated from incorrect tokens by a margin $\gamma$ in the feature space, provided all previous tokens were correct.
Pre-Training: The base model $q_0$ is obtained via Stochastic Gradient Descent (SGD) on labeled data. The paper analyzes both constant and adaptive learning rates (Adagrad-style).
Post-Training Scenarios:
1. Outcome Reward Model (ORM): The learner receives a binary reward $r(x, y) \in \{0, 1\}$ only after the full sequence is generated. This is formulated as a contextual bandit problem.
2. Process Reward Model (PRM): The learner receives intermediate rewards at each token step $i$ , verifying if the prefix $y_{1:i}$ is correct. This allows for step-by-step verification.
Algorithms: The paper analyzes variants of Policy Gradient (PG), specifically REINFORCE, with both constant and adaptive learning rates. It also explores "Best-of- $m$ " exploration strategies where the agent samples multiple candidates per step to find a correct one before updating.

3. Key Contributions

A. The Base Model Barrier with Outcome Rewards

The paper establishes a fundamental limitation when using Outcome Reward Models (ORM):

Conditional Convergence: If the base model assigns a likelihood $\alpha$ to the correct answer, PG can boost this to $1-\epsilon $with$ \tilde{O}((\alpha^{-1} + \epsilon^{-1})/\gamma^2)$ queries.
The Likelihood Quantile (LQ) Barrier: To achieve a global expected error below $\epsilon$ $ϵ$ , the number of reward queries depends on the Likelihood Quantile (LQ) of the base model, denoted $Q_{q_0}(\epsilon)$ $Q_{q_{0}} (ϵ)$ .
- $Q_{q_0}(\epsilon)$ is the $\epsilon$ -quantile of the likelihood distribution of the base model.
- Result: If the base model is trained via SGD, the LQ decays exponentially with sequence length $N$ for off-support samples. Consequently, to improve upon the base model's error rate for these samples, PG requires exponentially many reward queries ( $\sim k^N$ ).
- Lower Bound: The authors prove this is not an artifact of the algorithm but a fundamental statistical barrier. No algorithm with access to the base model and outcome rewards can achieve better than exponential query complexity for off-support samples without sufficient pre-training data.

B. Process Rewards Break the Barrier

The paper demonstrates that Process Reward Models (PRM) alleviate the curse of dimensionality:

Token-Level Likelihood Quantile (TL-LQ): Instead of the sequence-level LQ, PRM performance depends on the Token-Level Likelihood Quantile, $Q^{TL}_{q_0}(\epsilon)$ .
Result: $Q^{TL}_{q_0}(\epsilon)$ scales with $k$ (alphabet size) rather than $k^N$ .
Complexity: With PRM, the number of reward queries scales linearly with $N$ (sequence length) and polynomially with other parameters, rather than exponentially. This allows PG to effectively explore and learn from sequences where the base model has near-zero likelihood, provided the base model is somewhat accurate at the next-token prediction level.

C. Minimax Optimality and SGD Analysis

Optimality: The proposed PG variants with adaptive learning rates are proven to be minimax optimal (up to logarithmic factors) for both online learning (mistake bounds) and statistical learning (test error).
SGD Analysis: The paper provides a novel analysis of SGD for autoregressive models, showing that adaptive learning rates achieve near-optimal sample complexity ( $\tilde{O}(1/\gamma^2)$ ) even for long sequences ( $N \gg 1$ ), whereas constant learning rates suffer from linear dependence on $N$ .

4. Key Results

Feature	Outcome Reward (ORM)	Process Reward (PRM)
Feedback	Binary reward at end of sequence.	Intermediate rewards at each token.
Complexity Driver	Likelihood Quantile (LQ) of the base model.	Token-Level LQ of the base model.
Query Complexity	$\tilde{O}(Q_{q_0}(\epsilon)^{-1} / \gamma^2)$ . Can be exponential in $N$ ( $k^N$ ) for off-support samples.	$\tilde{O}(N \cdot Q^{TL}_{q_0}(\epsilon)^{-1} / \gamma^2)$ . Scales linearly in $N$ .
Base Model Barrier	Hard Barrier: Cannot efficiently learn off-support samples if base model likelihood is low.	Soft Barrier: Can learn off-support samples if base model has non-trivial next-token accuracy.
Optimality	Minimax optimal for the given setting.	Minimax optimal for the given setting.

Theorem 10 & 11: Establish lower bounds proving that any algorithm using outcome rewards requires $\Omega(Q_{q_0}(\epsilon)^{-1})$ queries, confirming the exponential barrier is unavoidable.
Theorem 12: Proves that even with optimal pre-training (SGD), the LQ cannot be made polynomially small in $N$ without an exponential number of pre-training samples.

5. Significance and Implications

Theoretical Explanation of RL Limitations: The paper provides a rigorous theoretical explanation for empirical observations that RL post-training often "sharpens" the base distribution but fails to generate genuinely new reasoning capabilities (off-support) unless the base model is already highly capable.
Justification for Process Rewards: It offers a theoretical justification for the recent shift in the field towards Process Reward Models (PRMs) and step-by-step verification (e.g., in math and coding tasks). PRMs are shown to be theoretically necessary to overcome the exponential sample complexity barrier inherent in outcome-only RL.
Algorithmic Efficiency: The work validates the use of adaptive learning rates in both pre-training (SGD) and post-training (PG), showing they are crucial for handling long sequences efficiently.
New Metrics: The introduction of Likelihood Quantile (LQ) and Token-Level LQ provides new theoretical tools for analyzing the "coverage" and "explorability" of language models, moving beyond simple accuracy metrics.

In summary, the paper concludes that Outcome Rewards are fundamentally limited by the base model's support, creating an exponential barrier for generating novel sequences. Process Rewards are the key to overcoming this, enabling efficient learning by breaking the sequence generation problem into manageable token-level steps.