The Big Question: Did We Teach the Model, or Did We Just Wake It Up?

Imagine you have a very talented but slightly confused musician (the AI model) who has practiced for years on their own (pre-training). Now, you want to teach them a new song.

There is a big debate in the AI world about how we teach them.

Method A (SFT): You play them a recording of a perfect performance and say, "Copy this exactly."
Method B (RL): You let them play, and every time they hit a good note, you give them a treat. Every time they hit a bad note, you don't.

The common belief is: Method A just makes them imitate what they already know (Imitation), while Method B helps them discover new, amazing things they never knew they could do (Discovery).

The authors of this paper say: "Stop. That distinction is too simple."

They argue that the real question isn't how you teach (copying vs. rewards), but what you are actually teaching. Did you just help the musician play a song they were already capable of but kept messing up? Or did you actually give them the ability to play a song they physically couldn't play before?

They call these two things:

Capability Elicitation: Waking up a skill that was already there but sleeping.
Capability Creation: Giving the musician a brand new skill they didn't have.

The "Energy Landscape" Analogy

To explain this, the authors use a physics concept called Free Energy. Imagine the musician's mind is a hilly landscape.

The Valleys (Basins): These are the easy songs the musician plays naturally. They are deep, comfortable, and easy to fall into.
The Hills (Tails): These are songs the musician could play, but they are very high up. It takes a lot of effort (or a lot of tries) to get there.
The Walls (Barriers): These are songs separated by a massive, unclimbable wall. The musician cannot reach them just by walking around; they need a ladder or a bridge.
The Other Side of the World (Unsupported): These are songs that simply don't exist in the musician's universe yet.

How Training Works on This Map

Both "Copying" (SFT) and "Rewards" (RL) work by tilting the landscape.

If you give a reward for a song in a Valley, the valley gets deeper. The musician plays that song more often.
If you give a reward for a song on a Hill, the hill gets a ramp. The musician can now climb up to that song more easily.

The Crucial Point:
If the song was already in a Valley or on a Hill, you haven't created a new ability. You've just made an existing ability more reliable. This is Elicitation.

If the song was behind a Wall, and your training method somehow built a bridge or a ladder to get there, then you have created a new ability. This is Creation.

The Four Zones of Learning

The paper breaks down post-training into four specific scenarios based on this map:

1. The "Safe Zone" (Demonstration-Covered Elicitation)

The Scenario: The musician already knows the song perfectly but sometimes forgets the lyrics. You show them the sheet music (demonstrations).
The Result: They stop forgetting. They didn't learn a new song; they just stabilized an old one.
The Takeaway: Whether you use copying or rewards, if the answer was already easy to find, you are just polishing a rough gem, not creating a new one.

2. The "Hidden Gem" (Tail Reweighting)

The Scenario: The musician knows a complex jazz solo, but they only play it once in a million tries. It's hidden in the "Hills."
The Result: You use a reward system to say, "Wow, that jazz solo was great!" Suddenly, they start playing it all the time.
The Takeaway: It looks like magic because the performance jumped up. But the musician could have played it all along; they just needed a nudge to find it. This is still Elicitation, not creation.

3. The "Bridge Builder" (Barrier-Crossing Discovery)

The Scenario: The musician needs to play a song that requires a sequence of steps they've never taken together. It's behind a wall.
The Result: You don't just give a reward at the end. You give rewards for steps along the way, or you let them use a tool (like a ladder) to cross the gap.
The Takeaway: This is Capability Creation. The training didn't just tilt the hill; it changed the terrain so the musician could reach a place they were previously blocked from.

4. The "Impossible Zone" (Unsupported Regimes)

The Scenario: You ask the musician to play a song that requires a violin, but they only have a guitar.
The Result: No amount of copying or rewarding will help. The "energy" required to play that song is infinite.
The Takeaway: You cannot "create" a capability here with just training. You need new information, a new instrument, or a different model entirely.

Why This Matters

The paper argues that we are often confused because we look at the method (SFT vs. RL) instead of the mechanism.

Myth: "RL is magic because it creates new skills."
Reality: RL only creates new skills if it is paired with tools, search, or interaction that helps the model cross "walls." If RL is just rewarding the model for things it could already do, it's just Elicitation.
Myth: "SFT is weak because it just copies."
Reality: If the "copying" data comes from a super-smart source (like a search engine or a stronger AI), SFT can teach the model things it never knew, effectively acting as Creation.

The Bottom Line

When we see an AI get better, we shouldn't just ask, "Did they use Reinforcement Learning?"

We should ask: "Did they just make the AI better at things it could already do, or did they actually give the AI the ability to do something it couldn't do before?"

The paper suggests that most of the time, we are just waking up skills that were already there (Elicitation), and we need to be very careful before claiming we have truly invented new capabilities (Creation).

Technical Summary: Distinguishing Capability Elicitation from Capability Creation in Post-Training

1. Problem Statement

The prevailing discourse in large language model (LLM) post-training often frames the distinction between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) as a dichotomy between imitation (SFT) and discovery (RL). This paper argues that this distinction is too coarse and obscures the fundamental mechanism of how post-training alters model behavior.

The core problem is determining whether a post-training procedure:

Elicits capabilities: Increases the probability of behaviors the pre-trained base model could already produce but did so unreliably.
Creates capabilities: Expands the set of behaviors the model can practically reach, enabling outcomes that were previously inaccessible.

The authors contend that labeling a method as "SFT" or "RL" does not determine its capability mechanism. Instead, the mechanism depends on the source of training signals (demonstrations vs. rewards), the generation of candidate behaviors, and whether the process expands the model's accessible support.

2. Methodology and Theoretical Framework

2.1 The Free-Energy Perspective

The authors formalize post-training using a free-energy framework, drawing an analogy to statistical physics ($F = E - TS$). They interpret post-training objectives as minimizing an effective free energy:
$F_x(q) = \mathbb{E}_{y \sim q(y|x)}[E(x, y)] + \beta \text{KL}[q(y|x) \parallel p_0(y|x)]$
Where:

$p_0(y|x)$ is the pre-trained reference distribution.
$q(y|x)$ is the post-trained distribution.
$E(x, y)$ is the effective energy derived from external signals.
$\beta$ acts as an inverse temperature, controlling the trade-off between exploiting preferred behaviors and maintaining diversity (KL constraint).

Key Theoretical Insights:

SFT as Energy: SFT minimizes negative log-likelihood on demonstrations. This is equivalent to defining an effective energy $E_{SFT}(x, y) = -\beta \log \frac{p_{demo}(y|x)}{p_0(y|x)}$ . If a behavior is in the demonstration distribution but has zero probability in the base model ( $p_0 \to 0$ ), the energy becomes singular, breaking the local reweighting interpretation.
RL as Energy: RL maximizes rewards subject to a KL constraint. This corresponds to $E_{RL}(x, y) = -R(x, y)$ . The optimal distribution is a Boltzmann reweighting of the reference: $q^*(y|x) \propto p_0(y|x) \exp(R(x, y)/\beta)$ .
Local Reweighting: When updates remain close to the reference model (strong KL constraint), the primary effect is local reweighting of the existing distribution, not the creation of new behaviors.

2.2 Accessible Support

To operationalize the distinction between elicitation and creation, the paper introduces accessible support: the set of behaviors a model can practically produce under finite sampling, optimization, and divergence budgets. This concept moves beyond strict mathematical support (non-zero probability) to practical reachability.

The authors categorize the behavioral landscape into four regimes based on the relationship between the target behavior and the base model's accessible support:

Demonstration-Covered Elicitation: The target behavior lies in a high-probability "basin" of the base model and is covered by demonstrations. Post-training stabilizes this existing behavior.
Tail Reweighting: The target behavior lies in the "tail" of the base model's distribution (rare under greedy decoding but reachable under larger sampling budgets like best-of-N). Post-training amplifies these rare but reachable behaviors.
Barrier-Crossing Discovery: The target behavior is separated from the base model's typical outputs by "barriers" (sequences of low-probability intermediate steps). Reaching these requires changing the trajectory-generation process (e.g., via search, tool use, or process supervision), not just reweighting.
Unsupported Regimes: The target behavior lies outside the base model's support ( $p_0(y|x) = 0$ ). The effective energy becomes divergent. Post-training cannot create these capabilities without new information, tools, or architectural changes.

3. Key Contributions

Reframing the SFT vs. RL Debate: The paper shifts the focus from algorithmic labels (SFT/RL) to the mechanism of capability change (elicitation vs. creation). It argues that SFT can elicit new behaviors if demonstrations are high-quality (covering the tail), and RL can be mere reweighting if constrained by a strong KL penalty.
Diagnostic Framework: By applying the free-energy perspective, the authors provide a mathematical tool to diagnose whether performance gains stem from local reweighting (within accessible support) or support expansion (crossing barriers).
The Four Regimes: The paper establishes a taxonomy for post-training outcomes, clarifying that "capability creation" is not a binary property of a method but a property of the interaction between the training signal, the candidate generation process, and the base model's reachability.
Clarification of "Creation": The authors argue that true capability creation (Barrier-Crossing Discovery) requires mechanisms that alter the trajectory generation process (e.g., search, interaction, tool use), rather than isolated reward maximization.

4. Results and Claims

The paper does not present new empirical benchmarks but offers a diagnostic analysis of existing post-training phenomena:

SFT is not inherently weak: If demonstrations contain trajectories generated by search or stronger models, SFT can elicit behaviors the base model rarely produces. The limitation of SFT is the coverage of the demonstration distribution, not the supervised objective itself.
RL is not inherently creative: If RL is applied with strong KL constraints and without search mechanisms, it merely reweights the base model's tail behaviors. Large benchmark gains in this regime reflect tail reweighting, not the creation of new capabilities.
The Singularity Boundary: The transition from elicitation to creation is marked by a singularity in the free-energy formulation. When $p_0(y|x) \to 0$ for a required behavior, the local reweighting view breaks down, indicating that the behavior is outside the accessible support.

5. Significance and Scope

The paper claims that distinguishing between capability elicitation and capability creation is essential for rigorous post-training research.

Modest Claims: The authors explicitly state they do not claim that SFT and RL are identical, nor that optimization dynamics are irrelevant. Instead, they argue that optimization dynamics must be interpreted relative to the regime (e.g., in barrier-crossing regimes, optimization must be coupled with trajectory-generation changes).
Scope: The framework is diagnostic. It clarifies that performance improvements alone are insufficient evidence of capability creation. To claim creation, one must demonstrate that the method expanded the model's reachable behavioral space, often through search, interaction, or new information, rather than simply reweighting existing probabilities.
Future Direction: The paper calls for future work to explicitly distinguish between these regimes. Researchers should report not just performance gains, but whether those gains reflect the stabilization of basins, the amplification of tails, or the crossing of barriers.

In summary, the paper posits that the central question in post-training is not "SFT or RL?" but "Does this method reweight what is already reachable, or does it expand what is reachable?"

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective