Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Here is an explanation of the paper "Good Reasoning Makes Good Demonstrations" using simple language and creative analogies.

The Big Problem: "Lucky Guesses" vs. "Real Genius"

Imagine you are teaching a student (an AI) how to solve math problems. You give them a test, and they get the right answer. You say, "Great job!" and give them a gold star.

But here's the catch: The student might have gotten the answer right by accident.

Student A solved it step-by-step, explained their logic, and showed their work.
Student B guessed the number, wrote down a bunch of nonsense, but somehow the final number matched the answer key.

In standard AI training (called RLVR), the computer treats both students exactly the same because the result is correct. It gives both a gold star. The problem? If the AI keeps getting gold stars for "Student B's" messy, lucky guesses, it learns that messy logic is fine as long as the answer is right. Eventually, the AI gets worse at actually thinking, even if it gets lucky on simple tests.

The Solution: "The Best Teacher is a Good Example"

The authors of this paper realized something brilliant: Not all correct answers are equally good teachers.

If you show a student a messy, confusing solution that happened to be right, it's a bad example. It confuses them.
If you show them a clear, logical, step-by-step solution, it's a great example. It teaches them how to think.

They call this "Demonstration Utility." It's basically asking: "If I use this solution as a teaching example for a future problem, will it help the student learn, or will it confuse them?"

The Magic Trick: "The Evidence Gain"

Usually, to figure out which solution is better, you need a human expert or a super-smart judge to grade the steps. That takes forever and costs a lot of money.

The authors found a clever shortcut. They realized the AI student already knows how to learn from examples. This is called In-Context Learning (ICL).

Here is their trick:

Take a messy solution (Student B) and a clean solution (Student A).
Ask the AI: "If I show you Student B's messy work as an example, how much easier is it for you to solve a NEW problem?"
Then ask: "If I show you Student A's clean work, how much easier is it?"

The difference in how much the AI's confidence improves is called "Evidence Gain."

High Evidence Gain: The solution was so clear and logical that it made the AI smarter immediately. (This is a "Good Teacher").
Low Evidence Gain: The solution was messy or lucky, so it didn't help the AI learn anything new. (This is a "Bad Teacher").

The Method: "In-Context RLVR"

So, how do they use this without hiring a human judge?

They changed the training process slightly. Instead of just asking the AI to solve a problem and checking the answer, they do this:

Before the AI tries to solve a new math problem, they prepend (stick at the front) a random "example solution" from their database.
The AI tries to solve the problem while looking at that example.
If the example was a "Good Teacher" (high quality), the AI learns faster and gets the answer right more often.
If the example was a "Bad Teacher" (low quality), the AI struggles more.

The Secret Sauce:
Because the AI learns better when shown good examples, the training process automatically gives more credit (rewards) to the AI when it generates solutions that look like those good examples.

It's like a gym where the machine automatically adjusts the weight. If you lift a weight while wearing "Good Teacher" glasses, the machine thinks you are stronger and gives you a bigger reward. If you wear "Bad Teacher" glasses, the reward is smaller.

The Result: Smarter, Not Just Luckier

By using this method, the AI stops trying to "hack" the system with lucky guesses. It starts focusing on clear, logical reasoning because that's what helps it learn from the examples it sees during training.

In short:

Old Way: "You got the answer right? Here's a cookie." (Even if you cheated).
New Way: "You got the answer right, AND your explanation helped me learn? Here's a HUGE cookie. If you guessed, here's a tiny cookie."

Why This Matters

This is a huge deal because it doesn't require expensive human judges or complex new software. It just uses the AI's own ability to learn from examples to grade its own work. It makes the AI smarter, more reliable, and better at solving hard problems, all while saving time and money.

The Takeaway: Good reasoning isn't just about getting the right answer; it's about teaching yourself (and others) how to get there. This paper teaches AI to value the journey of reasoning, not just the destination.

Here is a detailed technical summary of the paper "Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning."

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a dominant paradigm for improving Large Language Model (LLM) reasoning, particularly in mathematics where correctness can be verified by rules. However, standard RLVR suffers from a critical limitation: outcome-level supervision.

The Issue: In standard RLVR, all solutions that arrive at the correct final answer receive the same reward, regardless of the reasoning process used.
The Consequence: Models may reinforce "flawed traces"—reasoning paths that contain logical errors, redundancies, or hallucinations but coincidentally produce the correct answer (often due to guessing or simple values). This can corrupt the model's internal reasoning strategies, leading to degraded performance on harder problems.
Current Alternatives: Process Reward Models (PRMs) attempt to score intermediate steps but require expensive human annotation or auxiliary trained evaluators, making them difficult to scale.

Core Question: Can we encourage high-quality reasoning within RLVR without requiring step-level supervision, external evaluators, or costly PRMs?

2. Methodology: In-Context RLVR

The authors propose a novel framework called In-Context RLVR, which leverages the model's own In-Context Learning (ICL) ability to implicitly supervise reasoning quality.

A. Core Concept: Demonstration Utility & Evidence Gain

The paper introduces the concept of Demonstration Utility: the idea that high-quality reasoning traces are better "teachers" than low-quality ones. When a high-quality trace is used as an in-context demonstration, it helps the model generate better solutions for new problems.

Evidence Gain ( $\Delta$ ): To quantify this, the authors define a metric called Evidence Gain. It measures the increase in the model's log-likelihood of generating high-quality reference solutions when a candidate reasoning trace is prepended as a demonstration.
- Formula: $\Delta(q, r) = \mathbb{E}_{e \sim E} [\log \pi_\theta(e_r | q, r, e_q) - \log \pi_\theta(e_r | e_q)]$
- Where $r$ is the candidate trace, and $E$ is a held-out validation set of questions and high-quality reference traces.
Validation: Experiments show that $\Delta$ correlates strongly with human-annotated reasoning quality scores, effectively distinguishing good reasoning from bad, even among correct answers.

B. The Training Mechanism: Implicit Reweighting

Computing $\Delta$ explicitly for every rollout during training is computationally prohibitive. The authors use Bayesian analysis to show that explicit computation is unnecessary.

In-Context RLVR Procedure:
1. Before each training rollout, sample a demonstration $e = (e_q, e_r)$ from a validation set.
2. Prepend this demonstration to the current question $q$ .
3. Perform standard RL updates (e.g., DAPO/GRPO) on the model conditioned on this demonstration.
Theoretical Equivalence: The authors prove that optimizing the policy $\pi_\theta(r | e, q)$ $π_{θ} (r ∣ e, q)$ is mathematically equivalent to optimizing the base policy $\pi_\theta(r | q)$ $π_{θ} (r ∣ q)$ with implicitly reweighted rewards.
- The reward $R$ is multiplied by a weight factor $w(q, r) \propto \exp(\Delta(q, r))$ .
- Result: High-quality traces (high $\Delta$ ) naturally receive higher gradient signals, while low-quality traces receive lower weights, all without explicitly calculating $\Delta$ or using an external reward model.

3. Key Contributions

Evidence Gain: A new quality signal that measures reasoning quality by leveraging the policy model's intrinsic ICL ability, requiring no external evaluators or step-level supervision.
In-Context RLVR: A training algorithm that seamlessly integrates this signal by prepending demonstrations during training. Theoretically, this implicitly reweights rewards based on reasoning quality.
Efficiency: The method introduces less than 5% training overhead compared to standard RLVR, avoiding the massive computational cost of PRMs or explicit quality scoring.
Empirical Validation: Demonstrated improvements in both accuracy and reasoning quality across multiple mathematical benchmarks.

4. Experimental Results

The authors evaluated their method (termed IC-DAPO when combined with the DAPO algorithm) on DeepSeek-R1-Distill-Qwen models (1.5B and 7B parameters) across six mathematical benchmarks (AIME24/25, HMMT25, MATH500, AMC23, OlympiadBench).

Performance Gains:
- IC-DAPO outperformed the standard DAPO baseline by +2.5 average points across both model scales.
- Significant improvements were seen on competition-level benchmarks (e.g., +5.6 on AIME24 for the 1.5B model).
- It matched or exceeded more complex RL objective modifications (like GSPO and CISPO) while only altering the input distribution.
Reasoning Quality:
- Training dynamics analysis showed that the mean Evidence Gain increased steadily under IC-DAPO, while the correlation between Evidence Gain and human-rated quality remained stable ( $\rho \approx 0.4$ ) throughout training.
- This confirms the model is learning to generate better reasoning, not just correct answers.
Ablation Studies:
- Using higher-quality demonstrations (from DeepSeek-R1) yielded better results than using lower-quality ones (from DeepSeek-V3.1), confirming the importance of the demonstration set quality.
- The method remained robust across different model sizes.

5. Significance and Impact

Paradigm Shift: This work challenges the reliance on external Process Reward Models (PRMs) for reasoning supervision. It demonstrates that the model's own ICL capability can serve as a robust, self-supervised quality evaluator.
Scalability: By avoiding the need for human annotation or auxiliary reward models, this approach offers a highly scalable path to improving reasoning in LLMs.
Practicality: With minimal overhead (<5%), it provides a practical "drop-in" improvement for existing RLVR pipelines, making high-quality reasoning supervision accessible to researchers without massive computational resources.
Theoretical Insight: The derivation of the implicit reweighting mechanism provides a theoretical foundation for understanding how in-context demonstrations influence policy optimization, bridging the gap between demonstration learning and reinforcement learning.

In summary, the paper argues that "Good Reasoning Makes Good Demonstrations" and provides a mathematically grounded, efficient method to exploit this property to train LLMs that reason better, not just answer correctly.