Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

The paper introduces CURE, a framework that enhances long-form generation factuality by training large language models to estimate confidence at the individual claim level through a structured reasoning protocol and multi-stage calibration, thereby enabling more accurate selective prediction and significantly reducing hallucinations.

Xin Liu, Lu Wang

Published 2026-04-15
📖 5 min read🧠 Deep dive

Imagine you have a very talented, encyclopedic robot friend who loves to tell stories. This robot can write long, detailed biographies or explain complex topics better than almost anyone. But there's a catch: the robot is a terrible liar who doesn't know it's lying.

If the robot is 100% sure about a fact, it says it with a booming voice. If it's guessing, it still says it with that same booming voice. It might say, "David Bowie was born on Mars," with the exact same confidence as "David Bowie was born in London." This is called hallucination, and it makes the robot's long stories dangerous because you can't tell what's true and what's made up.

The paper introduces a new framework called CURE (Claim-level Uncertainty-aware Reasoning for Factual Generation). Think of CURE as a "truth coach" that teaches the robot a new way of thinking.

Here is how CURE works, broken down into simple analogies:

1. The "Atomic Claim" Breakdown (The Lego Analogy)

Before, when the robot wrote a biography, it wrote a giant, unbroken block of text. If one sentence was wrong, the whole block was suspicious.

CURE changes the game: It forces the robot to break its story down into tiny, individual Lego bricks (called "atomic claims").

  • Instead of writing a paragraph about David Bowie, the robot lists:
    • Brick 1: Born in London.
    • Brick 2: Changed his name in 1966.
    • Brick 3: Died in 2016.
  • The Magic: For every single brick, the robot must attach a confidence sticker.
    • "Born in London" gets a Gold Sticker (98% sure).
    • "Died in 2016" might get a Yellow Sticker (30% sure) because the robot is actually a bit fuzzy on the exact date.

2. The "Three-Stage Training Camp"

The authors realized you can't just tell the robot "be more accurate." It needs a specific training routine to learn how to be unsure when it should be. They use a three-step camp:

  • Stage 1: The "Format Police" (Feasibility Induction)
    First, they teach the robot the rules of the game. It must learn to break its answers into those Lego bricks and attach confidence stickers. If the robot tries to write a giant paragraph without stickers, the "police" (an automated system) send it back to the drawing board. This ensures the robot is actually thinking in small, verifiable pieces.

  • Stage 2: The "Honesty Coach" (Calibration Optimization)
    This is the most important part. The robot is shown examples where it was wrong but acted confident, or right but acted shy.

    • Scenario: The robot says, "I am 100% sure the moon is made of cheese."
    • Coach: "No! That's wrong. You should have said, 'I am 0% sure.'"
    • The robot learns to match its confidence sticker to the truth. If it's guessing, the sticker must be low. If it's sure, the sticker must be high. This stops the robot from being a "confident liar."
  • Stage 3: The "Fact-Checker" (Factuality Optimization)
    Now that the robot knows how to be honest about its uncertainty, they teach it to actually get the facts right. They reward it for getting the Lego bricks correct, but they make sure this training doesn't mess up the "Honesty Coach" lessons from Stage 2.

3. The "Selective Filter" (The Bouncer at the Club)

Once the robot is trained, it becomes incredibly useful in the real world. Imagine you ask the robot for a biography.

  • Old Robot: Writes a long story. You read it and realize half of it is nonsense, but you didn't know until the end.
  • CURE Robot: Writes the story, but it acts like a bouncer at a club.
    • It looks at its "Gold Sticker" facts (Born in London) and lets them into the final story.
    • It looks at its "Yellow Sticker" facts (The exact date of death) and says, "I'm not sure enough about this one," and leaves it out.
    • Result: The final story is shorter, but everything in it is highly reliable. The robot admits, "I don't know the rest," rather than making things up.

Why This Matters

The paper shows that by teaching the robot to think about what it doesn't know, it actually becomes smarter and more truthful.

  • Accuracy: It gets more facts right because it stops guessing confidently.
  • Trust: You can look at the confidence stickers and know exactly which parts of the story to trust and which parts to double-check.
  • Safety: In long stories (like medical advice or news), this prevents the robot from confidently spreading dangerous misinformation.

In short: CURE turns a robot that confidently lies into a robot that carefully checks its own work, admits when it's unsure, and only tells you what it knows for a fact. It's the difference between a confident salesperson selling you a fake watch and a cautious librarian who only hands you books they've verified.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →