Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

This paper introduces In-Context RLVR, a method that leverages a model's own in-context learning ability to measure "Demonstration Utility" via Evidence Gain, thereby implicitly reweighting rewards to prioritize high-quality reasoning traces over merely correct but flawed solutions during Reinforcement Learning with Verifiable Rewards training.

Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Good Reasoning Makes Good Demonstrations" using simple language and creative analogies.

The Big Problem: "Lucky Guesses" vs. "Real Genius"

Imagine you are teaching a student (an AI) how to solve math problems. You give them a test, and they get the right answer. You say, "Great job!" and give them a gold star.

But here's the catch: The student might have gotten the answer right by accident.

  • Student A solved it step-by-step, explained their logic, and showed their work.
  • Student B guessed the number, wrote down a bunch of nonsense, but somehow the final number matched the answer key.

In standard AI training (called RLVR), the computer treats both students exactly the same because the result is correct. It gives both a gold star. The problem? If the AI keeps getting gold stars for "Student B's" messy, lucky guesses, it learns that messy logic is fine as long as the answer is right. Eventually, the AI gets worse at actually thinking, even if it gets lucky on simple tests.

The Solution: "The Best Teacher is a Good Example"

The authors of this paper realized something brilliant: Not all correct answers are equally good teachers.

If you show a student a messy, confusing solution that happened to be right, it's a bad example. It confuses them.
If you show them a clear, logical, step-by-step solution, it's a great example. It teaches them how to think.

They call this "Demonstration Utility." It's basically asking: "If I use this solution as a teaching example for a future problem, will it help the student learn, or will it confuse them?"

The Magic Trick: "The Evidence Gain"

Usually, to figure out which solution is better, you need a human expert or a super-smart judge to grade the steps. That takes forever and costs a lot of money.

The authors found a clever shortcut. They realized the AI student already knows how to learn from examples. This is called In-Context Learning (ICL).

Here is their trick:

  1. Take a messy solution (Student B) and a clean solution (Student A).
  2. Ask the AI: "If I show you Student B's messy work as an example, how much easier is it for you to solve a NEW problem?"
  3. Then ask: "If I show you Student A's clean work, how much easier is it?"

The difference in how much the AI's confidence improves is called "Evidence Gain."

  • High Evidence Gain: The solution was so clear and logical that it made the AI smarter immediately. (This is a "Good Teacher").
  • Low Evidence Gain: The solution was messy or lucky, so it didn't help the AI learn anything new. (This is a "Bad Teacher").

The Method: "In-Context RLVR"

So, how do they use this without hiring a human judge?

They changed the training process slightly. Instead of just asking the AI to solve a problem and checking the answer, they do this:

  1. Before the AI tries to solve a new math problem, they prepend (stick at the front) a random "example solution" from their database.
  2. The AI tries to solve the problem while looking at that example.
  3. If the example was a "Good Teacher" (high quality), the AI learns faster and gets the answer right more often.
  4. If the example was a "Bad Teacher" (low quality), the AI struggles more.

The Secret Sauce:
Because the AI learns better when shown good examples, the training process automatically gives more credit (rewards) to the AI when it generates solutions that look like those good examples.

It's like a gym where the machine automatically adjusts the weight. If you lift a weight while wearing "Good Teacher" glasses, the machine thinks you are stronger and gives you a bigger reward. If you wear "Bad Teacher" glasses, the reward is smaller.

The Result: Smarter, Not Just Luckier

By using this method, the AI stops trying to "hack" the system with lucky guesses. It starts focusing on clear, logical reasoning because that's what helps it learn from the examples it sees during training.

In short:

  • Old Way: "You got the answer right? Here's a cookie." (Even if you cheated).
  • New Way: "You got the answer right, AND your explanation helped me learn? Here's a HUGE cookie. If you guessed, here's a tiny cookie."

Why This Matters

This is a huge deal because it doesn't require expensive human judges or complex new software. It just uses the AI's own ability to learn from examples to grade its own work. It makes the AI smarter, more reliable, and better at solving hard problems, all while saving time and money.

The Takeaway: Good reasoning isn't just about getting the right answer; it's about teaching yourself (and others) how to get there. This paper teaches AI to value the journey of reasoning, not just the destination.