Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

Imagine you have a very talented, encyclopedic robot friend who loves to tell stories. This robot can write long, detailed biographies or explain complex topics better than almost anyone. But there's a catch: the robot is a terrible liar who doesn't know it's lying.

If the robot is 100% sure about a fact, it says it with a booming voice. If it's guessing, it still says it with that same booming voice. It might say, "David Bowie was born on Mars," with the exact same confidence as "David Bowie was born in London." This is called hallucination, and it makes the robot's long stories dangerous because you can't tell what's true and what's made up.

The paper introduces a new framework called CURE (Claim-level Uncertainty-aware Reasoning for Factual Generation). Think of CURE as a "truth coach" that teaches the robot a new way of thinking.

Here is how CURE works, broken down into simple analogies:

1. The "Atomic Claim" Breakdown (The Lego Analogy)

Before, when the robot wrote a biography, it wrote a giant, unbroken block of text. If one sentence was wrong, the whole block was suspicious.

CURE changes the game: It forces the robot to break its story down into tiny, individual Lego bricks (called "atomic claims").

Instead of writing a paragraph about David Bowie, the robot lists:
- Brick 1: Born in London.
- Brick 2: Changed his name in 1966.
- Brick 3: Died in 2016.
The Magic: For every single brick, the robot must attach a confidence sticker.
- "Born in London" gets a Gold Sticker (98% sure).
- "Died in 2016" might get a Yellow Sticker (30% sure) because the robot is actually a bit fuzzy on the exact date.

2. The "Three-Stage Training Camp"

The authors realized you can't just tell the robot "be more accurate." It needs a specific training routine to learn how to be unsure when it should be. They use a three-step camp:

Stage 1: The "Format Police" (Feasibility Induction)
First, they teach the robot the rules of the game. It must learn to break its answers into those Lego bricks and attach confidence stickers. If the robot tries to write a giant paragraph without stickers, the "police" (an automated system) send it back to the drawing board. This ensures the robot is actually thinking in small, verifiable pieces.
Stage 2: The "Honesty Coach" (Calibration Optimization)
This is the most important part. The robot is shown examples where it was wrong but acted confident, or right but acted shy.
- Scenario: The robot says, "I am 100% sure the moon is made of cheese."
- Coach: "No! That's wrong. You should have said, 'I am 0% sure.'"
- The robot learns to match its confidence sticker to the truth. If it's guessing, the sticker must be low. If it's sure, the sticker must be high. This stops the robot from being a "confident liar."
Stage 3: The "Fact-Checker" (Factuality Optimization)
Now that the robot knows how to be honest about its uncertainty, they teach it to actually get the facts right. They reward it for getting the Lego bricks correct, but they make sure this training doesn't mess up the "Honesty Coach" lessons from Stage 2.

3. The "Selective Filter" (The Bouncer at the Club)

Once the robot is trained, it becomes incredibly useful in the real world. Imagine you ask the robot for a biography.

Old Robot: Writes a long story. You read it and realize half of it is nonsense, but you didn't know until the end.
CURE Robot: Writes the story, but it acts like a bouncer at a club.
- It looks at its "Gold Sticker" facts (Born in London) and lets them into the final story.
- It looks at its "Yellow Sticker" facts (The exact date of death) and says, "I'm not sure enough about this one," and leaves it out.
- Result: The final story is shorter, but everything in it is highly reliable. The robot admits, "I don't know the rest," rather than making things up.

Why This Matters

The paper shows that by teaching the robot to think about what it doesn't know, it actually becomes smarter and more truthful.

Accuracy: It gets more facts right because it stops guessing confidently.
Trust: You can look at the confidence stickers and know exactly which parts of the story to trust and which parts to double-check.
Safety: In long stories (like medical advice or news), this prevents the robot from confidently spreading dangerous misinformation.

In short: CURE turns a robot that confidently lies into a robot that carefully checks its own work, admits when it's unsure, and only tells you what it knows for a fact. It's the difference between a confident salesperson selling you a fake watch and a cautious librarian who only hands you books they've verified.

):** The model generates a chain of thought where it explicitly identifies candidate claims and expresses uncertainty in natural language (e.g., "I am not certain about this date"). 2. **Decomposition (`):** The model outputs a structured list of atomic claims, each paired with an explicit numerical confidence estimate ( $p_i \in [0, 1]$ ).

Atomic Claims: Each claim is a semantically coherent, independently verifiable unit (e.g., "David Bowie was born in 1947").
Explicit Confidence: The model assigns a confidence score to each claim, reflecting its internal belief in the claim's correctness.

B. Multi-Stage Training Pipeline

To prevent the conflict between optimizing for correctness and optimizing for calibration (which often leads to degenerate solutions like uniform high confidence), CURE employs a decoupled, multi-stage training pipeline:

Stage 1: Feasibility Induction (SFT + GRPO):
- Supervised Fine-Tuning (SFT): Initializes the model to follow the structured reasoning protocol. Confidence labels are corrected using an external verifier (VeriScore) and an LLM to ensure high-quality priors.
- Group Relative Policy Optimization (GRPO): Enforces feasibility constraints (relevance, verifiability, and faithfulness) to ensure the model generates well-formed, verifiable claims that strictly adhere to the reasoning trace.
Stage 2: Calibration Optimization (DPO):
- Goal: Align the model's predicted confidence ( $p_i$ ) with the empirical correctness ( $z_i$ ) of the claims.
- Method: Uses Direct Preference Optimization (DPO). The system constructs preference pairs where a "preferred" response ( $y_w$ ) has confidence scores corrected to match the ground truth (high confidence for correct claims, low for incorrect ones), while the "rejected" response ( $y_l$ ) has mismatched confidence.
- Why DPO? The authors found that using GRPO for calibration alone fails because reward signals are dominated by content variation (whether claims are correct) rather than calibration quality. DPO isolates the learning signal for confidence alignment.
Stage 3: Factuality Optimization (GRPO):
- Goal: Maximize factual accuracy without disrupting the calibrated confidence learned in Stage 2.
- Method: Uses GRPO with a token-masked reward. The reward is calculated based on claim correctness but is applied only to the tokens representing the claim content, masking the confidence reasoning tokens. This ensures the model improves factuality without re-learning or corrupting its uncertainty estimates.

C. Selective Prediction at Inference

During inference, the model generates claims with calibrated confidence scores. A threshold $\tau$ is applied to filter out claims with $p_i < \tau$ . The final response is constructed only from high-confidence claims, allowing the model to abstain from uncertain information, thereby increasing overall reliability.

3. Key Contributions

Claim-Level Calibration: Introduced a protocol that moves beyond global response confidence to fine-grained, atomic claim-level uncertainty estimation.
Decoupled Training Strategy: Demonstrated that jointly optimizing for factuality and calibration leads to overconfidence. Proposed a novel pipeline that structurally separates calibration (via DPO) from factuality optimization (via masked GRPO).
Selective Prediction: Enabled a mechanism for models to dynamically trade off recall for precision at inference time by filtering low-confidence claims, a capability absent in standard RL baselines.
New Benchmarking: Validated the approach across four diverse long-form factuality benchmarks (FactBench, LongFact, Biography, FactRBench).

4. Experimental Results

Experiments were conducted on Llama3.1-8B-Instruct and Qwen3-4B.

Factual Accuracy: CURE achieved state-of-the-art results across all benchmarks.
- Biography: Improved claim-level accuracy by 39.9% compared to the base model and significantly outperformed the strong RL baseline L2RF.
- FactBench: Achieved 84.4% accuracy (vs. 77.1% for L2RF).
- LongFact: Achieved 90.2% accuracy.
Calibration Quality:
- CURE significantly improved the AUROC (Area Under the ROC Curve), a metric measuring the ability to distinguish correct from incorrect claims. On FactBench, AUROC increased from 0.541 (Base) to 0.667 (CURE).
- The decoupled approach successfully prevented the "overconfidence collapse" seen in joint optimization baselines.
Ablation Studies:
- Joint Optimization Failure: Attempts to jointly optimize calibration and factuality via GRPO (even with weighted Brier scores) resulted in poor calibration (AUROC ~0.62) and severe overconfidence (95% of predictions > 0.9).
- Necessity of Decoupling: The staged approach (SFT → Calibration via DPO → Factuality via GRPO) was essential for achieving both high accuracy and high calibration.
- Feasibility RL: The initial feasibility stage was critical for establishing a reliable reasoning space; without it, factuality rewards became noisy due to unverifiable claims.

5. Significance

CURE represents a paradigm shift in improving LLM factuality. By explicitly modeling uncertainty at the claim level and decoupling calibration from correctness optimization, it addresses the root cause of confident hallucinations.

Trustworthiness: It provides end-users with transparent signals of uncertainty, allowing them to gauge the reliability of specific facts within a long response.
Controllability: The selective prediction mechanism allows users to tune the trade-off between coverage (recall) and accuracy (precision) without retraining the model.
Generalizability: The framework is effective across different model architectures (Transformer-based LLMs and reasoning-specialized models), suggesting it is a robust solution for the broader challenge of hallucination in generative AI.

Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

1. The "Atomic Claim" Breakdown (The Lego Analogy)

2. The "Three-Stage Training Camp"

3. The "Selective Filter" (The Bouncer at the Club)

Why This Matters

B. Multi-Stage Training Pipeline

C. Selective Prediction at Inference

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

LLMs Struggle with Abstract Meaning Comprehension More Than Expected

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG