Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Imagine you are teaching a brilliant but slightly stubborn student how to solve a complex puzzle.

In the past, if the student got it wrong, you might have to stop the class, rewrite their textbook, and re-teach them the whole subject (this is like retraining an AI). Or, you might just say, "Try again, but this time write a long essay about why you failed" (this is like Self-Refine or Reflexion, where the AI talks to itself).

This paper introduces a new, surprisingly simple way to teach the AI: Just give it a score.

The Core Idea: "The Scoreboard Effect"

The authors call this In-Context Reinforcement Learning (ICRL). Here is how it works in plain English:

The Game: You give the AI a task (like solving a math problem or writing a story).
The Attempt: The AI tries to solve it.
The Score: Instead of writing a long paragraph of feedback, you simply give it a number.
- Did it get the math right? Score: 10.
- Did it get it wrong? Score: 0.
- Did it write a coherent story? Score: 8.
The Loop: You show the AI its previous attempts along with the scores it got. Then you ask it to try again.
The Magic: The AI looks at the history: "Oh, I got a 0 when I did it that way, but I got a 10 when I did it this way. I'll try to do more of the 'this' way."

The AI isn't "learning" in the traditional sense of changing its brain (its internal code stays the same). Instead, it is learning in the moment by looking at the history of its own mistakes and successes, just like a human learning from a scoreboard.

Creative Analogies

1. The Video Game Player
Think of the AI as a gamer playing a new level.

Old Way (Self-Refine): The gamer dies, pauses the game, and writes a 5-page diary entry about why they died, then reads it before trying again.
This Paper's Way (ICRL): The gamer dies, sees the "Game Over" screen with a score of "0," sees the replay of their last 10 tries with their scores, and immediately tries a different path because they realize, "Ah, jumping there gets me a 10, but running there gets me a 0." They get better purely by looking at the scoreboard.

2. The Chef and the Critic
Imagine a chef trying to invent a new recipe.

Old Way: The chef tastes the soup, writes a long critique in a notebook ("Too salty, needs more basil"), reads the notebook, and tries again.
This Paper's Way: The chef tastes the soup, gets a simple score from a critic (1 to 10), looks at the list of the last 5 soups they made and their scores, and adjusts the next one. They don't need the critic to write an essay; the number is enough to guide them.

Why Is This a Big Deal?

The paper tested this on very hard tasks:

Math Competitions: Solving Olympiad-level math problems.
Creative Writing: Writing stories that make sense.
Science Experiments: Figuring out how to change the state of water in a virtual lab.

The Results:
The AI using this "Scoreboard" method (ICRL) got significantly better at these tasks than methods where the AI talks to itself or tries random variations.

In the "Game of 24" (a math puzzle), the AI went from getting it right 47% of the time to 90% just by looking at its previous scores.
It worked even when the "critic" giving the score was the AI itself!

The "Duck Test"

The authors use a famous saying: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck."

They argue that even though we didn't program the AI to "do Reinforcement Learning," it acts exactly like a Reinforcement Learning agent. It explores, it exploits good ideas, it learns from failure, and it improves over time just by seeing a number.

The Bottom Line

This paper suggests that we don't need to build complex new systems or retrain massive models to make AI smarter. We just need to let them play the game, see the score, and try again. The ability to learn from a simple number is already built inside these models; we just needed to give them the right way to look at it.

It's a shift from "teaching the AI" to "letting the AI teach itself by watching its own scoreboard."

1. Problem Statement

Large Language Models (LLMs) currently rely on test-time scaling strategies to improve performance on novel tasks. Existing approaches generally fall into two categories:

Search-based methods: Techniques like Best-of-N, Tree of Thoughts (ToT), and Monte Carlo Tree Search (MCTS) explore multiple trajectories but rely on external heuristics or memory management rather than intrinsic learning.
Supervised/Verbal feedback methods: Approaches like Self-Refine and Reflexion use natural language self-correction or verbal feedback. However, these often suffer from hallucinated feedback, accumulation of errors, and a lack of true learning from failure, as they rely on the model's parametric knowledge to generate "better" instructions rather than optimizing a signal.

The Gap: There is a lack of evidence that LLMs can perform Reinforcement Learning (RL) purely during inference (without parameter updates) using only scalar rewards. The paper asks: Can LLMs act as in-context reinforcement learners (ICRL), optimizing scalar reward signals during inference to self-improve on diverse, open-ended tasks?

2. Methodology: ICRL Prompting

The authors introduce ICRL Prompting, a minimal framework designed to elicit RL behavior in LLMs during the inference phase.

Core Mechanism: The LLM acts as a policy $\pi_\theta$ $π_{θ}$ that is conditioned on a context $C_t$ $C_{t}$ . This context is constructed by concatenating:
1. The task description ( $s_{task}$ ).
2. A meta-instruction ( $s_{ICRL}$ ) guiding the model to explore or exploit.
3. A history of previous attempts (state-action pairs) and their associated numerical scalar rewards.
The Loop:
1. Generation: The LLM generates a response (action sequence).
2. Evaluation: A reward function $r$ (which can be rule-based, a separate model, or the same LLM acting as a judge) provides a scalar reward for the response.
3. Context Update: The response and its reward are appended to the context buffer.
4. Iteration: The LLM is prompted again with the expanded context to generate a new response, aiming to maximize the cumulative reward.
Design Principles:
- Minimality: The framework excludes textual gradients, prioritized experience replay, or engineered modules. The only supervision is the scalar reward.
- Instruction Types: The meta-instruction can be set to Exploration (generate a different response), Exploitation (generate the best response based on past high rewards), or Autonomous (let the LLM decide).
- Reward Source: Rewards can be external (rule-based) or internal (self-evaluation by the same LLM), testing the hypothesis that "evaluation is easier than generation."

3. Key Contributions

Framework Introduction: Proposed ICRL Prompting, a minimal design that isolates the LLM's intrinsic capacity for in-context RL using only scalar rewards and state-action-reward tuples.
Evidence of Emergent RL: Provided strong empirical evidence that LLMs exhibit RL behaviors during inference, including:
- Maximization of scalar reward signals over time.
- Exploration-exploitation trade-offs.
- Performance degradation when context is truncated or rewards are zeroed out.
- "Duck test" validation: The behavior matches theoretical expectations of RL algorithms.
Superior Performance: Demonstrated that ICRL significantly outperforms state-of-the-art self-revision methods (Self-Refine, Reflexion) and search baselines (Best-of-N) across diverse benchmarks.

4. Experimental Results

The framework was evaluated on four distinct benchmarks:

Game of 24 (Math Logic):
- Setup: Solve math puzzles using 4 numbers.
- Reward: Generated by the same LLM (GPT-4.1) estimating the likelihood of reaching 24.
- Result: ICRL Preset achieved a 90% success rate, significantly outperforming Best-of-N (49%), Self-Refine (47%), and Reflexion (44%).
Creative Writing:
- Setup: Generate coherent 4-paragraph stories.
- Reward: Coherence score (1-10) from an LLM judge.
- Result: ICRL achieved a 93.81% win rate against Best-of-N and 86.32% against Self-Refine in length-controlled evaluations. ICRL showed continuous improvement, whereas Self-Refine plateaued and declined.
ScienceWorld (Interactive Environment):
- Setup: Text-based scientific experiments with sparse rewards.
- Result: ICRL achieved a mean return of 88 (vs. 83 for Self-Refine and 74 for Reflexion), demonstrating robustness in sparse-reward environments.
Olympiad Mathematics (AIME & HMMT):
- Setup: Solve high-level math competition problems.
- Result: ICRL improved performance by 10–20 points over base models and outperformed Self-Refine and Reflexion across various open-source models (Qwen3, Llama-4, Phi-4).

Ablation Studies:

Zero Rewards: Performance dropped to baseline levels, proving the reward signal is essential.
Short Context: Limiting the history to 3 episodes caused performance drops, confirming the need for long-term memory.
Exploration Only: Without rewards, performance was poor, proving that ICRL is not just "Best-of-N" sampling but genuine learning from failure.

Mechanistic Analysis:
Attention head analysis on Qwen3-32B revealed that ~29% of attention heads showed statistically significant correlation with reward signals. Some heads attended to high-reward (successful) examples, while others attended to low-reward (failure) examples, mirroring classical RL learning from both success and failure.

5. Significance and Implications

Paradigm Shift: The paper challenges the notion that LLMs require retraining or complex external search mechanisms to improve. It suggests that scalar rewards alone are sufficient to trigger self-improvement in inference.
Test-Time Scaling: ICRL offers a new, efficient path for test-time scaling. Unlike search methods that scale linearly with compute, ICRL leverages the model's internal capacity to learn from its own trajectory, showing better scaling properties in terms of compute budget.
Autonomous Agents: This capability points toward a future where AI agents can autonomously explore, adapt, and self-improve in open-ended, real-world environments without human intervention or costly retraining cycles.
Validation of "Reward is Enough": The results support the hypothesis that intelligence can be understood as the maximization of expected cumulative reward, even within the static parameters of a pre-trained LLM during inference.