Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

This paper introduces ConStory-Bench, a new benchmark with 2,000 prompts and a detailed error taxonomy, alongside the ConStory-Checker automated pipeline, to systematically evaluate and analyze the prevalence and patterns of consistency errors in long-form story generation by Large Language Models.

Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you hire a brilliant, hyper-fast storyteller to write a massive novel for you—say, 10,000 words long. You give them a prompt: "Write a story about a mother and child in a broken-down car in 1982."

The AI starts typing. It's amazing at first. But by the time it reaches page 50, something strange happens. The mother suddenly has a son who is 15 years old, even though she just said he was five. The car, which was a Pontiac, is now a Ford. The year flips from 1982 to 1995. The story is still fluent and exciting, but it has forgotten its own rules.

This is the problem ConStory-Bench is trying to solve.

Here is a simple breakdown of the paper, using some everyday analogies.

1. The Problem: The "Amnesiac" Storyteller

Large Language Models (LLMs) are like incredibly talented actors who can improvise a whole play on the spot. But if the play gets too long, they start to forget their lines, their character's backstory, or the rules of the world they are in.

  • The Old Way: Previous tests for AI writers mostly asked, "Is this story fun? Is the grammar good?" They didn't really check if the story made sense from start to finish.
  • The New Discovery: The researchers found that as stories get longer, the AI gets "lost in the sauce." It starts contradicting itself more often, especially regarding facts (what color is the car?) and time (did this happen before or after that?).

2. The Solution: The "Story Detective" (ConStory-Bench)

To fix this, the team built a new testing ground called ConStory-Bench. Think of this as a giant gym where they put AI models to the test.

  • The Workout: They gave 2,000 different prompts to various AIs, asking them to write stories between 8,000 and 10,000 words long.
  • The Scorecard: They didn't just count mistakes. They created a specific "Error Taxonomy" (a list of mistakes). Imagine a detective's checklist with 19 different types of clues:
    • Timeline: Did the character age backwards?
    • Character: Did the hero suddenly forget how to use a sword they learned in Chapter 1?
    • World Rules: Did gravity suddenly stop working in a fantasy world?
    • Details: Did the character's eye color change from blue to brown?

3. The Tool: The "Auto-Inspector" (CONSTORY-CHECKER)

Checking 2,000 stories for contradictions by hand would take a human years. So, the researchers built CONSTORY-CHECKER.

  • How it works: Imagine a super-attentive editor who reads the story twice.
    1. Scan: It looks for suspicious spots (like a mention of "snow" in July).
    2. Pair Up: It finds the earlier part of the story that said "it was summer" and pairs it with the "snow" comment.
    3. Evidence: It doesn't just say "Error!" It points to the exact sentences: "Look, on page 2 you said it was July, but on page 50 you said it was snowing."
  • The Result: This tool is so good that it actually found more mistakes than human experts did in their tests. Humans get tired and miss subtle contradictions; the AI inspector never blinks.

4. What They Found: The "Mid-Story Slump"

After testing dozens of AI models (from big companies like OpenAI and Google to open-source ones), they found some fascinating patterns:

  • The "Middle" is Dangerous: Errors don't happen evenly. They tend to cluster in the middle of the story (around the 40–60% mark). It's like a marathon runner who starts strong, hits a "wall" in the middle, and then tries to finish. The AI forgets the beginning while trying to write the middle.
  • The "Confusion" Signal: The researchers looked at the AI's "brain waves" (mathematically called entropy). They found that right before the AI makes a mistake, it gets uncertain. It's like a person hesitating before speaking. If the AI is "guessing" too much, it's about to mess up the facts.
  • The "Chain Reaction": If an AI gets the facts wrong (e.g., the car is red), it often messes up the character's memory too (e.g., the character remembers the car being blue). These errors tend to travel in packs.
  • The Best Performers: Currently, the model GPT-5-Reasoning is the best at keeping the story straight, followed closely by some Google and Anthropic models. But even the best ones still make mistakes.

5. Why This Matters

Think of long-form storytelling like building a skyscraper.

  • Fluency is the paint and the windows.
  • Consistency is the steel beams holding it up.

If the beams are weak (the story contradicts itself), the whole building might look pretty, but it will collapse under its own weight. This paper gives us the tools to inspect the steel beams.

The Bottom Line:
AI is getting great at writing long stories, but it still suffers from "short-term memory loss" when the story gets too long. This research gives us a way to measure that memory loss, find exactly where it happens, and eventually teach the AI to be a more reliable storyteller who never forgets its own plot.