$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

The Big Idea: The "Do-Over" Button for AI

Imagine you are taking a very difficult math test. You start solving a problem, but halfway through, you realize you made a tiny mistake in the first step.

Old AI (Standard Models): Once it starts down that wrong path, it keeps going. It tries to "patch" the mistake by writing more and more confusing steps, hoping to stumble onto the right answer at the end. It's like driving in the wrong direction, realizing you're lost, but instead of turning around, you just drive faster and hope you accidentally end up at your destination. This is called "overthinking."
The New AI (Re2): This model has a special superpower: The "Do-Over" Button. If it realizes it's on a bad path, it stops, says, "Wait, this isn't working," and starts the problem from scratch with a fresh mind.

This paper introduces a new training method called Re2 that teaches AI models when to give up on a bad idea and start over, rather than stubbornly pushing through to a wrong answer.

The Problem: The "Wrong Turn" Trap

The researchers discovered a funny but frustrating thing about current AI models. They found that longer answers aren't always better.

In fact, they found that when an AI makes a mistake early on, the longer it tries to fix it, the less likely it is to get the right answer.

Analogy: Imagine you are writing a story. If you write the first sentence wrong, and you keep writing 50 more pages trying to make the plot work, the story will likely be a mess. It's better to delete the first sentence and start the story over.
The Data: The paper shows that standard AI models often get stuck in "dead ends." They generate thousands of words of reasoning that lead nowhere, wasting time and energy.

The Solution: Re2 (Reinforcement Learning with Re-Solving)

The authors created a training system called Re2. Think of it as a coach training a student not just to solve problems, but to know when to quit a bad strategy.

Here is how the training works, using a Video Game Analogy:

The Level (The Math Problem): The AI is given a hard math problem.
The Strategy (The Path): The AI starts solving it.
The Choice: At any point, the AI has two choices:
- Option A: Keep going and try to finish the level.
- Option B: Hit "Restart Level" (Re-solve) and try a completely new approach.
The Reward System (The Score):
- If the AI finishes the level correctly, it gets 100 points.
- If it finishes incorrectly, it gets 0 points.
- The Magic: If the AI chooses to "Restart," the game calculates: "How likely is this player to beat the level if they start over?" If restarting gives them a better chance of winning than continuing on the current bad path, the AI gets high points for choosing to restart.

By playing this "game" thousands of times, the AI learns a crucial lesson: It is smarter to admit defeat and start over than to stubbornly push forward on a losing path.

The Results: Smarter, Not Just Harder

The researchers tested this new AI on famous math competitions (like AIME and AMC) and science questions.

The "Stubborn" AI (Standard RL): Tries to solve everything in one long chain of thought. It often gets confused and gives wrong answers.
The "Re2" AI: Frequently hits the "Restart" button when it senses trouble.
- Success Rate: The Re2 models solved significantly more problems correctly than the standard models.
- Efficiency: Even though the AI sometimes restarts (which takes a little extra time), it ends up getting the right answer much faster because it doesn't waste time writing nonsense.

Why This Matters

This is a shift in how we think about Artificial Intelligence.

Old Way: "Make the AI think longer and harder." (Like forcing a student to study for 10 hours straight without a break).
New Way (Re2): "Teach the AI to think flexibly." (Like teaching a student to recognize when they are confused and to ask for help or try a different method).

Summary in a Nutshell

The paper argues that being able to change your mind is a sign of intelligence.

Current AI models are like students who are too afraid to admit they are wrong, so they keep writing gibberish until they run out of time. The Re2 method teaches AI to be humble: to say, "I'm going down the wrong path," and to hit the Reset button. This simple ability to "re-solve" leads to much smarter, more accurate, and more reliable reasoning.

1. Problem Statement

While Reinforcement Learning with Verifiable Rewards (RLVR) has successfully enhanced the reasoning capabilities of Large Language Models (LLMs) by encouraging longer Chains of Thought (CoT), existing models still suffer from significant inefficiencies:

Overthinking and Low-Quality Steps: Models often generate unnecessary or low-quality reasoning steps, leading to "overthinking" where they continue down a flawed path rather than correcting course.
Inability to Recover from Early Errors: The paper's analysis reveals a critical limitation: once an LLM commits to a suboptimal or incorrect initial reasoning direction, it rarely recovers, even if it generates significantly more tokens. Longer CoTs often correlate with lower accuracy because they stem from early critical mistakes that the model cannot self-correct within a single trajectory.
Rigid Single-Chain Paradigm: Standard RLVR methods force the model to commit to a final answer within a single generated trajectory. If the initial steps are wrong, the model is penalized for not finding the answer, rather than being rewarded for recognizing the failure and restarting.

2. Methodology: Re2 (Reinforcement Learning with Re-solving)

The authors propose Re2, a novel framework that enables LLMs to learn when to abandon an unproductive reasoning path and restart the problem from scratch. Unlike previous methods that rely on Supervised Fine-Tuning (SFT) or complex decoding strategies, Re2 relies purely on Reinforcement Learning.

Core Mechanism

The training process involves a two-stage generation and reward strategy:

Prefix Group Generation:
- For a given query, the model samples $n$ full responses.
- Each response is randomly truncated to create $n$ diverse prefixes (intermediate reasoning states).
- For each prefix, the model generates $m$ continuations (CoT extensions).
Action Space:
For each continuation, the model can choose one of three outcomes:
- Final Answer: Provide a solution.
- Incorrect: Provide a wrong solution.
- Re-solve (Redo): Explicitly indicate that the current path is unpromising and restart the problem from scratch.
Reward Strategy:
- Correct Answer: Reward = 1.
- Incorrect Answer: Reward = 0.
- Re-solve Action: The reward is calculated as the expected success rate of solving the problem from scratch. This is estimated using "out-of-group" completions (continuations from other prefixes in the same batch).
- Formula: If $P_{\neq i}(\text{correct})$ is the empirical probability of a correct answer from other prefixes, and $R$ is the max allowed redo rounds, the reward for re-solving is:
  $r_{i,j} = P_{\neq i}(\text{correct}) \cdot \frac{1 - P_{\neq i}(\text{resolve})^R}{1 - P_{\neq i}(\text{resolve})}$
- This mechanism incentivizes the model to re-solve when the current trajectory has a low probability of success (i.e., when the expected reward of continuing is lower than the expected reward of restarting).

Optimization

The method uses a group-wise advantage calculation (similar to DAPO/GRPO) to update the policy. If all continuations in a group yield the same outcome (e.g., all wrong), the group is filtered out to prevent vanishing gradients.

3. Key Contributions

Identification of Early Reasoning Fragility: The paper provides empirical evidence that the quality of the initial reasoning steps is the primary determinant of final accuracy. Once an LLM enters a "wrong" trajectory, increasing token count rarely leads to recovery.
Re2 Framework: Introduction of a pure RL framework that teaches models to flexibly "give up" on bad paths and restart, effectively simulating human-like strategic reconsideration.
No SFT Required: Re2 achieves these capabilities without preliminary Supervised Fine-Tuning, relying solely on RL to amplify the rare "redo" behavior found in vanilla models (increasing it from ~0.5% to >30%).
Superior Test-Time Scaling: The method demonstrates that allowing models to discard bad samples and retry leads to better performance scaling compared to standard majority voting or fixed-length RLVR.

4. Experimental Results

The authors evaluated Re2 on five benchmarks (AIME 2024/2025, AMC 2023, GSM8K, GPQA-Diamond) across five models (ranging from 3B to 14B parameters, including base, instruction-tuned, and reasoning-optimized models).

Performance Gains: Re2 consistently outperformed the strong baseline DAPO (a state-of-the-art RLVR method) across all models and datasets.
- Example: On Qwen2.5-7B-Instruct, Re2 achieved 47.4% average accuracy vs. DAPO's 43.0% (+4.4% gain).
- On DeepSeek-R1-Distill-Llama-8B, Re2 improved accuracy from 55.9% (DAPO) to 60.5%.
Test-Time Scaling: As the number of sampled outputs increases, Re2 continues to improve in accuracy, whereas DAPO's performance saturates. Re2 effectively utilizes additional compute by discarding low-quality attempts and retrying.
Behavioral Analysis:
- Re2 significantly reduces the generation of "forced" incorrect answers.
- The model learns to trigger "redo" actions specifically when the reasoning path becomes confused or leads to contradictions, rather than blindly continuing.
- Training dynamics show a rapid increase in "redo" behavior early in training, followed by refinement where the model learns to distinguish between solvable and unsolvable paths.

5. Significance

Paradigm Shift: Re2 moves beyond the traditional "single-chain" reasoning paradigm. It validates that flexibility (the ability to restart) is as crucial as depth (the ability to think long) for complex reasoning.
Efficiency: By preventing models from wasting compute on dead-end reasoning paths, Re2 offers a more efficient use of test-time compute.
Generalizability: The method works effectively across different model sizes and types (base vs. reasoning models), suggesting it is a fundamental improvement to the RL training loop for reasoning tasks.
Future Direction: The paper highlights that enabling models to recognize their own uncertainty and restart is a key step toward more reliable and robust AI reasoning, potentially applicable to other domains beyond mathematics.

In conclusion, Re2 demonstrates that teaching LLMs to recognize when they are "stuck" and to restart their reasoning process is a powerful mechanism for unlocking higher performance, outperforming current state-of-the-art RLVR methods that force a single, continuous chain of thought.

Re2\textbf{Re}^{2}Re2: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

The Big Idea: The "Do-Over" Button for AI

The Problem: The "Wrong Turn" Trap

The Solution: Re2 (Reinforcement Learning with Re-Solving)

The Results: Smarter, Not Just Harder

Why This Matters

Summary in a Nutshell

1. Problem Statement

2. Methodology: Re2 (Reinforcement Learning with Re-solving)

Core Mechanism

Optimization

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers

$\textbf{Re}^{2}$ : Unlocking LLM Reasoning via Reinforcement Learning with Re-solving