FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

The Big Picture: Teaching a Student to Think, Not Just Guess

Imagine you are training a brilliant but mischievous student (an AI) to solve complex math problems. You tell them, "If you get the right answer, you get a gold star."

The Problem:
The student is smart enough to figure out the right answer, but they sometimes take "cheats."

The Cheat: Instead of doing the hard math, they guess the answer, or they skip a crucial step and jump straight to the conclusion.
The Mistake: Because the final answer is correct, you give them a gold star anyway.
The Consequence: The student learns that cheating works. They stop trying to learn the actual math and start relying on shortcuts. Eventually, they hit a wall where they can't solve harder problems because they never learned the real rules.

This paper calls these "cheating but correct" answers "Flawed Positives."

The Solution: FAPO (The Smart Coach)

The authors propose a new training method called FAPO (Flawed-Aware Policy Optimization). Think of FAPO as a Smart Coach who doesn't just look at the final score; they watch the whole game to see how the student played.

Here is how FAPO works in three simple stages:

1. The "Warm-Up" Phase: Let the Student Cheat (A Little)

When the student is a total beginner, they can't solve the problems correctly on their own.

What FAPO does: It allows the student to use "cheats" (flawed positives) to get the right answer.
Why? It's better to get the right answer by guessing than to get it wrong. This gives the student confidence and a quick boost in skills. It's like letting a new driver use cruise control on a straight highway to get used to the car.

2. The "Detection" Phase: The Magic Magnifying Glass

To know if the student cheated, the coach needs a way to spot the shortcuts.

The Tool: The authors built a special AI tool called GenRM (Generative Reward Model). Think of this as a Magic Magnifying Glass.
How it works: When the student writes a solution, the Magnifying Glass scans every single step. If the student skipped a step or guessed, the glass highlights the exact spot where the logic broke. It doesn't just say "Wrong"; it says, "You skipped step 3!"

3. The "Refinement" Phase: Stop the Cheating

As the student gets better, the coach changes the rules.

The Shift: Once the student is capable of solving problems correctly, the coach stops giving gold stars for cheating.
The Penalty: If the student gets the right answer but used a shortcut, the coach now gives them a negative score (a penalty).
The Result: The student realizes, "Oh, cheating doesn't work anymore. I need to actually learn the math to get a good score." They start building genuine reasoning skills.

Why This is a Game-Changer

Most AI training methods are like a strict teacher who only cares about the final grade. If the answer is right, the grade is A. If it's wrong, it's F. They don't care how the student got there.

FAPO is different because:

It's Patient: It lets the student use shortcuts early on to learn the basics quickly.
It's Strict Later: It gradually removes the safety net, forcing the student to become a true expert.
It Saves Time: The student doesn't need to write longer, more rambling answers to prove they are smart. They just need to be reliably smart.

The Real-World Results

The paper tested this on difficult math competitions (like AIME) and general knowledge quizzes (GPQA).

Before FAPO: The AI got better quickly at first, but then plateaued because it was stuck in "cheating mode."
With FAPO: The AI got better, and then kept getting better. It solved more problems correctly, made fewer logical errors, and didn't need to write longer answers to do it.

Summary Analogy

Imagine training a dog to fetch a ball.

Old Method: You throw the ball. If the dog brings it back, you give a treat. If the dog steals the ball from a neighbor's yard and brings it back, you still give a treat. The dog learns to steal balls.
FAPO Method:
- Early Days: If the dog brings the ball back (even if it stole it), you give a treat. The dog learns to fetch.
- Later Days: You watch closely. If the dog steals the ball, you say "No!" and take the treat away. If the dog fetches the ball properly, you give a treat.
- Outcome: The dog becomes a reliable fetcher, not a thief.

In short: FAPO teaches AI to think deeply and reliably, rather than just guessing its way to the right answer.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a dominant paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). In this setting, models are optimized using rule-based outcome rewards (typically binary: +1 for a correct final answer, -1 otherwise).

However, the authors identify a critical flaw in this paradigm: Flawed-Positive Rollouts.

Definition: A rollout is "flawed-positive" if the model produces a correct final answer but relies on unreliable reasoning patterns (e.g., answer-guessing, jump-in-reasoning, or logical leaps) to get there.
The Issue: Standard RLVR treats these flawed-positive rollouts identically to fully correct ones (both receive +1 reward). This reinforces unreliable reasoning patterns.
The Dilemma: The authors observe a dual effect:
1. Early Stage: Flawed positives act as "stepping stones," allowing models to reach correct answers quickly and accelerate initial capability gains.
2. Late Stage: As models mature, these flawed patterns hinder further improvement by trapping the policy in unreliable reasoning shortcuts, limiting the ceiling of reasoning reliability and process quality.

Existing methods fail to distinguish between "lucky guesses" and "genuine reasoning," leading to unstable training and suboptimal reasoning reliability.

2. Methodology: FAPO

The authors propose Flawed-Aware Policy Optimization (FAPO), a framework designed to dynamically adjust the reward signal based on the presence of reasoning flaws. FAPO consists of two main components:

A. Generative Reward Model (GenRM) for Flawed-Positive Detection

To identify flawed positives, FAPO introduces a specialized Generative Reward Model (GenRM) trained via Reinforcement Learning.

Architecture: A compact generative model (based on Qwen3-4B-Instruct) trained to not only judge correctness but also localize the specific step where a reasoning error occurs.
Training Objective: The GenRM is trained with a composite reward:
- Outcome Reward: Binary correctness of the flaw detection (Did it correctly identify if the sample was flawed?).
- Process Reward: A step-wise penalty based on the distance between the predicted error location and the ground-truth error location. This forces the model to learn where the error is, rather than just guessing "flawed" or "correct."
Efficiency: Unlike using massive teacher models (e.g., Qwen3-32B) for inference during RL, the trained GenRM is lightweight and efficient, making it suitable for large-scale RL loops.

B. Adaptive Policy Optimization with Reward Penalization

FAPO modifies the standard Group Relative Policy Optimization (GRPO) algorithm by introducing a parameter-free reward penalty for detected flawed positives.

Reward Function:
$R_{FAPO} = R_{RLVR} + R_{\Delta}$
Where $R_{\Delta}$ applies a penalty ( $-\lambda$ ) if the rollout is correct but detected as flawed by the GenRM.
Adaptive Learning Trajectory:
- Warm-up Stage: When the model has low capability, the penalty is effectively neutralized or less impactful because the model relies on flawed positives to learn basic correctness. The system allows these "shortcuts" to facilitate rapid initial gains.
- Refinement Stage: As the model improves and the proportion of fully correct rollouts increases, the optimization naturally shifts. The penalty on flawed positives becomes active, discouraging unreliable patterns and forcing the model to learn genuine problem-solving strategies.
Theoretical Guarantee: The authors provide a theoretical analysis showing that this mechanism creates a natural shift in optimization direction without requiring complex hyperparameter tuning (default $\lambda=1$ ).

3. Key Contributions

Systematic Analysis of Flawed Positives: The paper provides the first systematic study revealing that flawed positives persist throughout RL training and exert a "twofold effect" (accelerating early gains but constraining late-stage reliability).
FAPO Algorithm: A novel, parameter-free policy optimization method that dynamically balances the exploitation of flawed positives (for speed) and the penalization of them (for reliability).
GenRM with Process-Level Localization: A highly efficient generative reward model trained to precisely locate reasoning errors, outperforming state-of-the-art discriminative models and larger teacher models in detection accuracy.
Infrastructure Design: A decoupled, asynchronous architecture that integrates the GenRM into the RL loop without significantly increasing training time (less than 20% overhead) or token budget.

4. Experimental Results

The authors evaluated FAPO on mathematical reasoning (AIME24, AIME25) and general domain reasoning (GPQA-Diamond) using Qwen2.5-7B and Qwen2.5-32B models.

Performance Gains:
- AIME24: FAPO-7B improved accuracy by +4.7% over the baseline.
- AIME25: FAPO-32B improved accuracy by +3.1%.
- GPQA-Diamond: FAPO-32B improved accuracy by +1.5%.
Process Reliability:
- The ratio of flawed-positive rollouts decreased significantly (e.g., from ~15.5% to 7.1% on AIME24 for the 32B model).
- Human verification confirmed that FAPO models produce fewer unreliable reasoning patterns compared to baselines.
Training Stability: FAPO exhibited smoother learning curves and avoided the performance drops often seen in later training stages of standard RLVR.
Efficiency: FAPO achieved these gains without increasing the token budget (average response length remained comparable or slightly reduced), proving that the improvements come from better reasoning quality, not just longer chains of thought.
GenRM Performance: The FAPO-GenRM-4B achieved an F1 score of 89.4% on the FlawedPositiveBench, outperforming the 32B teacher model and the 72B discriminative PRM.

5. Significance and Impact

Reliability in RLVR: FAPO addresses a fundamental limitation in current RLVR research: the inability to distinguish between "correct answers via luck" and "correct answers via reasoning." This is crucial for deploying LLMs in high-stakes domains where reasoning transparency is required.
Efficiency: By using a small, specialized GenRM and a parameter-free penalty strategy, FAPO offers a scalable solution that does not require massive computational resources for reward modeling.
Theoretical Insight: The work establishes a new understanding of the RL training trajectory, suggesting that a "warm-up" phase utilizing flawed positives followed by a "refinement" phase penalizing them is the optimal path for reasoning models.
Future Directions: The framework is applicable beyond math to code generation, multi-turn interactions, and agent-based RL, offering a blueprint for building more robust and reliable reasoning systems.

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

The Big Picture: Teaching a Student to Think, Not Just Guess

The Solution: FAPO (The Smart Coach)

1. The "Warm-Up" Phase: Let the Student Cheat (A Little)

2. The "Detection" Phase: The Magic Magnifying Glass

3. The "Refinement" Phase: Stop the Cheating

Why This is a Game-Changer

The Real-World Results

Summary Analogy

1. Problem Statement

2. Methodology: FAPO

A. Generative Reward Model (GenRM) for Flawed-Positive Detection

B. Adaptive Policy Optimization with Reward Penalization

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank