Reinforcement Learning with Conditional Expectation Reward

Imagine you are teaching a brilliant but very literal student (a Large Language Model) how to solve complex problems, from math equations to explaining why the sky is blue.

The Old Way: The "Strict Grader"

Traditionally, we used a method called RLVR (Reinforcement Learning with Verifiable Rewards). Think of this as a Strict Grader who only accepts answers that are exactly right.

How it works: If the question is "What is 2+2?", the grader only gives a gold star if the student writes "4".
The Problem: This works great for math or coding, where there is one clear right answer. But for general questions like "Is quantum physics deterministic?", the answer could be "No," "Not really," or "It's probabilistic."
The Failure: The Strict Grader is rigid. If the student writes "Not really," the grader gives a zero, treating it the same as a completely wrong answer like "Yes, it's a toaster." The student gets no help on how to improve, only a harsh "Wrong." This makes learning difficult for open-ended questions.

The New Way: The "Self-Reflective Mentor" (CER)

The authors of this paper propose a new method called Conditional Expectation Reward (CER). Instead of hiring an external grader, they teach the student to grade themselves using a clever trick.

Think of CER as a Self-Reflective Mentor. Here is how it works:

The Scenario: The student generates an answer (let's say, "Quantum physics is not deterministic").
The Question: The Mentor asks the student: "If you had to generate the 'perfect' reference answer again, knowing that you just wrote your current answer, how likely would you be to write the perfect one?"
The Logic:
- If the student's answer was perfectly aligned with the truth, the model thinks, "Oh, I'm very confident in this. If I tried again, I'd definitely get the right answer." -> High Reward.
- If the student's answer was close but slightly off, the model thinks, "Hmm, I'm pretty sure, but maybe I'd tweak a word." -> Medium Reward.
- If the answer was wildly wrong, the model thinks, "No way. If I tried again, I wouldn't get the right answer from this starting point." -> Low Reward.

Why is this a game-changer?

1. The "Shades of Gray" vs. "Black and White"
The old method was Black and White (Right or Wrong). CER is a Gradient. It gives partial credit.

Analogy: Imagine a dartboard. The old method only gives points if you hit the bullseye. If you miss by an inch, you get zero. CER gives you points for being close to the bullseye, encouraging you to aim better next time.

2. No External Tools Needed
Usually, to check if a general answer is good, you need a human or a special AI tool (a "Verifier") to read it. That's expensive and slow.

Analogy: CER is like a musician who can hear a note and instantly know if it's in tune without needing a tuner app. The model uses its own internal "ear" to judge its own work.

3. It Handles Variety
In the real world, there are many ways to say the same thing.

Analogy: If the answer is "The sky is blue," the old grader might reject "The sky is azure" or "It's blue." CER understands that these are all "close enough" to the truth and rewards the student for being semantically correct, even if the words are different.

The Results

The researchers tested this on both math problems and general knowledge (like physics and finance).

On Math: It performed just as well as the strict, rule-based methods.
On General Topics: It crushed the competition. It learned faster and better because it wasn't discouraged by "almost right" answers.

Summary

CER is a smarter way to train AI. Instead of a harsh teacher who only accepts perfect answers, it uses a self-reflective mentor that gives graded feedback. It tells the AI, "You're getting warmer," rather than just "You're wrong." This allows AI to learn complex, open-ended reasoning tasks much more effectively, without needing expensive external tools to check its work.

Here is a detailed technical summary of the paper "Reinforcement Learning with Conditional Expectation Reward" (CER).

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has successfully enhanced the reasoning capabilities of Large Language Models (LLMs), particularly in domains like mathematics where answers can be verified via strict, rule-based checkers (e.g., exact string matching or symbolic equivalence).

However, RLVR faces two critical limitations when applied to general reasoning domains (e.g., physics, finance, open-ended questions):

Lack of Verifiers: Constructing handcrafted, domain-specific verification rules for free-form answers is costly, often infeasible, and difficult to make exhaustive due to the high variability of valid answers.
Binary Feedback: Traditional rule-based verifiers provide binary signals (0 or 1). They treat all non-exact matches as equally incorrect, failing to provide positive reinforcement for partially correct or semantically equivalent answers. This results in sparse, noisy reward signals that hinder effective optimization in general domains.

2. Methodology: Conditional Expectation Reward (CER)

The authors propose Conditional Expectation Reward (CER), a novel reward mechanism that eliminates the need for external verifiers or auxiliary models by leveraging the policy model itself as an implicit verifier.

Core Definition

CER is defined as the expected likelihood of generating the reference answer ( $a^*$ ) conditioned on the model having generated a specific answer ( $a$ ).
Mathematically, for a quadruple $(q, s, a, a^*)$ where $q$ is the question, $s$ is the solution, $a$ is the generated answer, and $a^*$ is the reference answer:
$\rho(a, a^*) := \mathbb{E}_{s' \sim \pi_\theta(\cdot|q, a)} [\pi_\theta(a^*|s', q)]$
Intuitively, if the generated answer $a$ is consistent with the reference $a^*$ , the model should assign a high probability to regenerating $a^*$ given the context of $a$ .

Empirical Formulation

Since the exact expectation is intractable, the authors derive an empirical estimator using Bayes' rule and Monte Carlo sampling:
$\rho(a, a^*) \approx \frac{\sum_{j=1}^M \pi_\theta(a|s_j, q) \pi_\theta(a^*|s_j, q)}{\sum_{j=1}^M \pi_\theta(a|s_j, q)}$
Where $s_j$ are $M$ independent samples drawn from the policy $\pi_\theta(\cdot|q)$ .

Weighting: The term $\pi_\theta(a|s_j, q)$ acts as a weight, ensuring that samples where the model is confident about both the generated answer $a$ and the reference $a^*$ contribute more to the reward.
Efficiency: The method reuses the samples $s_j$ already generated for policy gradient estimation, incurring no additional sampling cost.

Theoretical Properties

Boundedness: The reward is strictly between 0 and 1.
Self-Consistency: If $a = a^*$ , the reward is maximized (or at least higher than the unconditional probability), creating a "self-consistency amplification" effect.
Value Equivalence: The expected value of the CER objective is equivalent to the Exact-Match objective ( $\mathbb{E}[\rho] = \mathbb{E}[I(a=a^*)]$ ). This proves CER is a soft, continuous relaxation of the hard exact-match criterion.
Graded Feedback: Unlike binary verifiers, CER provides continuous rewards reflecting the degree of semantic consistency, allowing the model to learn from partially correct answers.

3. Key Contributions

Generalizable RLVR: Introduced a framework that extends RLVR from narrow, rule-verifiable domains to general, open-ended reasoning domains without requiring external verifiers.
Model-Intrinsic Verification: Demonstrated that the LLM itself can serve as a reliable verifier by exploiting internal consistency between generated and reference answers.
Soft Reward Signal: Replaced sparse binary feedback with dense, graded signals that capture partial correctness and semantic overlap.
Theoretical Foundation: Provided proofs showing CER is a smooth relaxation of exact-match evaluation, preserving the theoretical optimality of the original objective while improving learning dynamics.
Efficient Implementation: Developed a tensorized computation method that reuses existing policy samples, ensuring the approach is computationally efficient.

4. Experimental Results

The authors evaluated CER on Mathematical (MATH-7.5K, MATH500, AIME) and General (WebInstruct, SuperGPQA, MMLU-Pro) reasoning tasks using Qwen3-4B and Qwen3-8B models.

General Domains: CER achieved the highest average performance among all baselines (including Rule-based, Exact-Match, and Model-based verifiers like General-verifier) on general-domain datasets. It significantly outperformed exact-match and perplexity-based methods (VeriFree).
Mathematical Domains: CER performed comparably to rule-based verifiers and outperformed learned verifier approaches, demonstrating it does not overfit to specific domains.
Complementarity: Combining CER with Rule-based rewards (Rule+CER) yielded the best overall results, suggesting CER fills the gaps where rule-based methods fail (handling surface variations) while rules correct CER's potential estimation errors in strict math contexts.
Efficiency Analysis: The method allows a trade-off between accuracy and speed via the hyperparameter $M$ (number of samples). Even with small $M$ , CER outperformed complex external verifiers while maintaining reasonable runtime.

5. Significance

This work addresses a major bottleneck in scaling Reinforcement Learning for LLMs: the dependency on rigid, handcrafted verification rules. By introducing CER, the authors provide a flexible, universal, and self-contained reward mechanism that:

Enables RL training for free-form, open-ended reasoning tasks where ground-truth verification is ambiguous.
Improves learning efficiency by providing dense, informative gradients for partially correct answers.
Reduces the engineering overhead of building domain-specific verifiers, making advanced reasoning capabilities more accessible across diverse fields (science, finance, humanities).

In summary, CER represents a paradigm shift from external, binary verification to internal, probabilistic consistency, offering a robust path toward generalizable reasoning in Large Language Models.