Generalization of RLVR Using Causal Reasoning as a Testbed

Imagine you have a very smart student (a Large Language Model, or LLM) who is great at memorizing facts but sometimes struggles to solve complex, step-by-step puzzles. You want to teach them how to solve a specific type of puzzle: Causal Reasoning.

Think of causal reasoning like figuring out cause-and-effect in a Rube Goldberg machine. If you push the first domino (cause), which one falls last? If you remove a gear (intervention), does the machine stop? If the machine had been set up differently yesterday (counterfactual), what would have happened?

This paper investigates a new teaching method called RLVR (Reinforcement Learning with Verifiable Rewards) and compares it to the old standard, SFT (Supervised Fine-Tuning).

Here is the breakdown using simple analogies:

1. The Two Teaching Methods

SFT (The "Answer Key" Teacher): The teacher shows the student a problem and immediately gives the correct answer. The student tries to memorize the pattern: "When I see X, I write Y." It's like studying for a test by only looking at the back of the book.
RLVR (The "Coach" with a Scoreboard): The teacher gives the student a problem and lets them try to solve it step-by-step. If the final answer is right, the student gets a "point" (reward). If it's wrong, they get zero. The student learns by trying, failing, and realizing, "Oh, I messed up the math in step 3," and then trying again. The reward is "verifiable" because the answer can be checked by a computer.

2. The Experiment: The "Causal Ladder"

The researchers built a giant playground of puzzles called RLCausal. They created three levels of difficulty, like rungs on a ladder:

Level 1: Association (The "What's happening?" rung): "If I see smoke, is there fire?" (Observation).
Level 2: Intervention (The "What if I change it?" rung): "If I smother the fire, does the smoke stop?" (Action).
Level 3: Counterfactual (The "What if things were different?" rung): "If I hadn't lit the match yesterday, would the room be dark today?" (Hypothetical).

They tested models of different sizes: a 3B (a small, eager but confused puppy), a 7B (a smart teenager), and a 32B (a brilliant professor).

3. The Big Discovery: The "Cold Start" Problem

The most important finding is that RLVR only works if the student already knows how to think.

The Small Model (3B): Imagine trying to teach a puppy to play chess using the "Coach" method. The puppy doesn't know the rules. It tries to move pieces randomly, gets zero points, and eventually gives up, just guessing the answer.
- Result: The 3B model failed. It couldn't learn the reasoning steps because it didn't have the basic "reasoning muscles" to start with.
The Big Models (7B & 32B): These models already had some logic skills. When the "Coach" (RLVR) started giving them feedback, they didn't just memorize answers; they learned better strategies.
- Result: They got much better at solving complex puzzles than the "Answer Key" students (SFT).

4. What Did RLVR Actually Fix?

When the big models used RLVR, they didn't just get lucky. They changed how they thought:

From "Brute Force" to "Step-by-Step": Before, they tried to write out a massive, complicated formula all at once (like trying to eat a whole pizza in one bite). This often led to dropping ingredients (math errors). After RLVR, they learned to eat slice by slice (incremental marginalization), which is much safer.
Fewer "Hallucinations": They stopped making up facts or forgetting to check the rules of the game.
Better Generalization: If you trained them on Level 1 puzzles, they could actually solve Level 2 and Level 3 puzzles better than the SFT students. They learned the skill of reasoning, not just the specific answers.

5. The "Counterfactual" Wall

There was one level where everyone struggled: Counterfactuals (the "What if" questions). Even the big models had a hard time.

Why? These questions require building a "twin world" in your head where you change the past but keep the present. It's a very heavy mental lift.
The Twist: Even when the researchers gave the models a hint on how to build this twin world, they still struggled. This suggests that for the hardest types of logic, current AI models just aren't there yet, no matter how much they are trained.

The Takeaway

RLVR is a powerful tool, but it's not magic.

If your AI model is too small or too dumb to begin with, throwing it into a "trial and error" training camp won't help; it will just get confused.
But, if you have a smart model that already has a spark of reasoning ability, RLVR is like a personal trainer that helps it build muscle, refine its technique, and solve problems it never could have solved before.

In short: You can't teach a toddler to do calculus with a reward system. But if you have a smart teenager, a reward system can turn them into a math genius.

Here is a detailed technical summary of the paper "Generalization of RLVR Using Causal Reasoning as a Testbed."

1. Problem Statement

The paper investigates the conditions under which Reinforcement Learning with Verifiable Rewards (RLVR) enables Large Language Models (LLMs) to generalize robustly beyond their training data. While RLVR has shown success in domains like math and code, its generalization capabilities in complex, multi-step reasoning tasks remain underexplored.

The authors use probabilistic inference over causal graphical models as a rigorous testbed. This domain is chosen because:

It offers a structured hierarchy of reasoning difficulty known as the Causal Ladder: Association (observation), Intervention (do-calculus), and Counterfactuals (hypothetical reasoning).
It allows for precise, verifiable rewards (exact probability distributions) without relying on human feedback.
It isolates reasoning capabilities from natural language understanding by providing fully specified structural causal models (SCMs) as input.

The core research questions are:

Does RLVR yield better within-level and across-level generalization compared to Supervised Fine-Tuning (SFT)?
How do model scale and the initial reasoning competence of the base model affect RLVR's effectiveness?
What specific reasoning sub-skills (e.g., marginalization strategies, error reduction) does RLVR improve?

2. Methodology

Dataset Construction (RLCausal)

The authors constructed a synthetic dataset called RLCausal featuring:

Structural Causal Models (SCMs): Directed Acyclic Graphs (DAGs) with 10 binary variables.
Query Levels:
- Association: $P(v_i | v_j = v_j)$
- Intervention: $P(v_i | do(v_j = c))$
- Counterfactual: $P(v_i(v_j=c) | v_k = v_k)$
Difficulty Metrics: Complexity is measured by the size of the relevant subgraph ( $|V_{rel}|$ ), defined as the number of nodes required to compute the query after graph modifications (e.g., removing incoming edges for interventions or creating twin networks for counterfactuals).
Data Split: 8,000 training, 2,000 development, and 8,000 test examples per query level, ensuring disjoint SCMs between splits.

Experimental Setup

Base Models: Qwen-2.5-Instruct family (3B, 7B, and 32B parameters).
Training Paradigms:
- RLVR: Uses GRPO (Group Relative Policy Optimization) and DAPO. The model outputs a reasoning chain followed by a final probability distribution. Rewards are based on format correctness and Total Variation Distance (TVD) between the predicted and ground-truth distributions.
- SFT: Supervised Fine-Tuning on the ground-truth probability distribution (direct prediction) or, in ablation studies, on correct reasoning chains.
Variables: The study varies model size and the query level used during training to test within-level (train and test on same level) and across-level (train on one level, test on another) generalization.

Analysis Techniques

The authors employed an LLM judge (o4-mini) to analyze reasoning traces, categorizing:

Marginalization Strategies: Incremental (step-by-step), Brute Force (full joint summation), Neighbors (immediate parents only), and No Marginalization.
Error Types: Probability derivation errors (e.g., wrong formulas, false independence assumptions), copy-paste errors, and arithmetic errors.

3. Key Contributions & Findings

A. RLVR vs. SFT Generalization

Model Size Dependency: RLVR significantly outperforms SFT in generalization only for models with sufficient capacity (≥7B).
- 3B Models: RLVR fails to improve 3B models; they often regress to direct prediction without reasoning.
- 7B & 32B Models: RLVR achieves superior within-level and across-level generalization compared to SFT, particularly on complex queries.
Query Level Specifics: RLVR excels on Association and Intervention queries. However, it struggles with Counterfactual queries across all model sizes, likely due to the extreme difficulty of the required abduction and twin-network construction.

B. The Role of Initial Reasoning Competence (The "Cold Start" Problem)

Scaling Laws: The effectiveness of RLVR is heavily dependent on the model's prior reasoning competence before fine-tuning.
Zero-Shot Baseline: Larger models (32B) with zero-shot reasoning prompts outperform smaller models even after fine-tuning.
Threshold Effect: If a model cannot perform basic reasoning steps (like marginalization) in a zero-shot setting, RLVR cannot "teach" it to do so; the model simply learns to skip reasoning and guess. RLVR amplifies existing competence rather than creating it from scratch.

C. Mechanism of Improvement

When the initial competence is sufficient, RLVR improves performance by:

Shifting Marginalization Strategy: It encourages incremental marginalization (summing out variables one by one) over brute-force summation. This reduces the cognitive load and error rate in long calculation chains.
Reducing Abstract Reasoning Errors: RLVR significantly reduces errors in probability derivation (e.g., confusing intervention with observation, dropping dependency terms).
Precision: RLVR models produce answers that are not just correct but more precise (closer to the ground truth) compared to SFT models, which often produce approximate answers.

D. Counterfactual Challenges

Despite the structured setting, models failed to generalize on counterfactual queries. Even providing explicit hints on constructing "twin networks" in the system prompt did not significantly improve performance, suggesting that the combination of abduction and marginalization in this formal setting is a fundamental bottleneck for current LLMs.

4. Significance and Implications

Redefining RLVR Success: The paper challenges the notion that RLVR is universally superior to SFT. It establishes that RLVR is a specialist tool that requires a "warm start" of sufficient reasoning capability. It is not a magic bullet for small models or tasks where the base model has zero prior competence.
Causal Reasoning Benchmark: The RLCausal dataset provides a controlled, mathematically rigorous benchmark for evaluating LLM reasoning, distinct from natural language causal benchmarks (like CLadder) which conflate reasoning with language understanding.
Training Strategy Insights: The findings suggest that for complex reasoning tasks, a hybrid approach or careful selection of base models is crucial. Simply applying RLVR to a model that cannot reason step-by-step will likely lead to collapse into direct prediction.
Future Directions: The work highlights the need to improve the "cold start" capabilities of LLMs in formal reasoning domains and suggests that future RLVR research should focus on the interplay between execution quality (calculation) and strategy quality (reasoning path).

Conclusion

The paper concludes that RLVR effectively enhances specific causal reasoning sub-skills (marginalization and derivation) and improves generalization, but only when the base model possesses a sufficient level of initial reasoning competence. Scaling up model size improves this prior, making larger models (≥7B) the primary beneficiaries of RLVR, while smaller models (3B) and highly complex tasks (Counterfactuals) remain challenging frontiers.