Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Imagine you hire a brilliant, hyper-fast storyteller to write a massive novel for you—say, 10,000 words long. You give them a prompt: "Write a story about a mother and child in a broken-down car in 1982."

The AI starts typing. It's amazing at first. But by the time it reaches page 50, something strange happens. The mother suddenly has a son who is 15 years old, even though she just said he was five. The car, which was a Pontiac, is now a Ford. The year flips from 1982 to 1995. The story is still fluent and exciting, but it has forgotten its own rules.

This is the problem ConStory-Bench is trying to solve.

Here is a simple breakdown of the paper, using some everyday analogies.

1. The Problem: The "Amnesiac" Storyteller

Large Language Models (LLMs) are like incredibly talented actors who can improvise a whole play on the spot. But if the play gets too long, they start to forget their lines, their character's backstory, or the rules of the world they are in.

The Old Way: Previous tests for AI writers mostly asked, "Is this story fun? Is the grammar good?" They didn't really check if the story made sense from start to finish.
The New Discovery: The researchers found that as stories get longer, the AI gets "lost in the sauce." It starts contradicting itself more often, especially regarding facts (what color is the car?) and time (did this happen before or after that?).

2. The Solution: The "Story Detective" (ConStory-Bench)

To fix this, the team built a new testing ground called ConStory-Bench. Think of this as a giant gym where they put AI models to the test.

The Workout: They gave 2,000 different prompts to various AIs, asking them to write stories between 8,000 and 10,000 words long.
The Scorecard: They didn't just count mistakes. They created a specific "Error Taxonomy" (a list of mistakes). Imagine a detective's checklist with 19 different types of clues:
- Timeline: Did the character age backwards?
- Character: Did the hero suddenly forget how to use a sword they learned in Chapter 1?
- World Rules: Did gravity suddenly stop working in a fantasy world?
- Details: Did the character's eye color change from blue to brown?

3. The Tool: The "Auto-Inspector" (CONSTORY-CHECKER)

Checking 2,000 stories for contradictions by hand would take a human years. So, the researchers built CONSTORY-CHECKER.

How it works: Imagine a super-attentive editor who reads the story twice.
1. Scan: It looks for suspicious spots (like a mention of "snow" in July).
2. Pair Up: It finds the earlier part of the story that said "it was summer" and pairs it with the "snow" comment.
3. Evidence: It doesn't just say "Error!" It points to the exact sentences: "Look, on page 2 you said it was July, but on page 50 you said it was snowing."
The Result: This tool is so good that it actually found more mistakes than human experts did in their tests. Humans get tired and miss subtle contradictions; the AI inspector never blinks.

4. What They Found: The "Mid-Story Slump"

After testing dozens of AI models (from big companies like OpenAI and Google to open-source ones), they found some fascinating patterns:

The "Middle" is Dangerous: Errors don't happen evenly. They tend to cluster in the middle of the story (around the 40–60% mark). It's like a marathon runner who starts strong, hits a "wall" in the middle, and then tries to finish. The AI forgets the beginning while trying to write the middle.
The "Confusion" Signal: The researchers looked at the AI's "brain waves" (mathematically called entropy). They found that right before the AI makes a mistake, it gets uncertain. It's like a person hesitating before speaking. If the AI is "guessing" too much, it's about to mess up the facts.
The "Chain Reaction": If an AI gets the facts wrong (e.g., the car is red), it often messes up the character's memory too (e.g., the character remembers the car being blue). These errors tend to travel in packs.
The Best Performers: Currently, the model GPT-5-Reasoning is the best at keeping the story straight, followed closely by some Google and Anthropic models. But even the best ones still make mistakes.

5. Why This Matters

Think of long-form storytelling like building a skyscraper.

Fluency is the paint and the windows.
Consistency is the steel beams holding it up.

If the beams are weak (the story contradicts itself), the whole building might look pretty, but it will collapse under its own weight. This paper gives us the tools to inspect the steel beams.

The Bottom Line:
AI is getting great at writing long stories, but it still suffers from "short-term memory loss" when the story gets too long. This research gives us a way to measure that memory loss, find exactly where it happens, and eventually teach the AI to be a more reliable storyteller who never forgets its own plot.

Here is a detailed technical summary of the paper "Lost in Stories: Consistency Bugs in Long Story Generation by LLMs."

1. Problem Statement

While Large Language Models (LLMs) have advanced significantly in generating long-form narratives (spanning tens of thousands of words), they frequently suffer from narrative inconsistency. As context windows expand, models struggle to maintain global coherence, often contradicting established facts, character traits, world rules, and temporal logic.

Existing benchmarks and evaluation methods primarily focus on plot quality, fluency, and local coherence, leaving global consistency errors largely unexplored. Furthermore, current automated evaluation approaches often lack explicit textual evidence and interpretable rationales, relying on "black-box" judgments that are difficult to audit or reproduce.

2. Methodology

The authors propose a comprehensive framework consisting of a new benchmark, an automated evaluation pipeline, and a systematic analysis of model behaviors.

A. ConStory-Bench (The Benchmark)

Scale: Contains 2,000 high-quality prompts designed to elicit long-form stories (targeting 8,000–10,000 words).
Data Sources: Derived from seven diverse public corpora (e.g., LongBench, WritingPrompts, WikiPlots) and rewritten via LLMs to fit specific task structures.
Task Scenarios: The prompts are categorized into four distinct narrative generation tasks:
1. Generation: Creating a story from a minimal plot setup.
2. Continuation: Extending an existing story fragment.
3. Expansion: Elaborating a concise outline into a full narrative.
4. Completion: Filling in the middle of a story with a predefined start and end.
Error Taxonomy: A hierarchical classification system defining 5 top-level categories and 19 fine-grained subtypes:
1. Timeline & Plot Logic: (e.g., absolute time contradictions, causal violations, abandoned plot elements).
2. Characterization: (e.g., memory contradictions, skill fluctuations, forgotten abilities).
3. World-building & Setting: (e.g., core rules violations, geographical contradictions).
4. Factual & Detail Consistency: (e.g., appearance mismatches, nomenclature confusion, quantitative mismatches).
5. Narrative & Style: (e.g., perspective shifts, tone inconsistencies).

B. ConStory-Checker (The Evaluation Pipeline)

An automated, three-stage "LLM-as-a-Judge" pipeline designed to detect contradictions and provide auditable evidence:

Category-Guided Extraction: Scans narratives using category-specific prompts to extract contradiction-prone spans.
Contradiction Pairing: Compares extracted spans pairwise to classify them as consistent or contradictory, reducing false positives.
Evidence Chains: For every detected error, the system generates a structured output containing:
- Reasoning: Why the two spans cannot both be true.
- Evidence: Exact quotations of the conflicting text with precise character-level offsets.
- Conclusion: The specific error type label.

C. Evaluation Metrics

To address length bias and prompt difficulty, the authors introduce two metrics:

Consistency Error Density (CED): Normalizes the error count by output length (errors per 10,000 words).
Group Relative Rank (GRR): Ranks models within each prompt group based on a quality score ( $Q = \frac{\text{words}}{1 + \text{errors}}$ ), accounting for both consistency and narrative completeness.

3. Key Contributions

ConStory-Bench: The first large-scale benchmark specifically designed to evaluate narrative consistency in ultra-long text generation, featuring a rigorous taxonomy of 19 error subtypes.
ConStory-Checker: An automated pipeline that not only detects errors but grounds every judgment in explicit textual evidence, enabling reproducible and interpretable evaluation.
Systematic Empirical Analysis: A comprehensive evaluation of 20+ models (proprietary, open-source, capability-enhanced, and agentic) guided by five research questions, revealing non-random patterns in consistency failures.

4. Key Results & Findings

The study evaluated a wide range of models, including GPT-5-Reasoning, Gemini-2.5-Pro, Claude-Sonnet-4.5, and various open-source models (Qwen, GLM, DeepSeek).

Performance Landscape:
- GPT-5-Reasoning achieved the best performance (lowest CED: 0.113, best GRR: 2.80).
- Open-source models like GLM-4.6 and Qwen3-32B showed competitive performance, approaching proprietary levels.
- Capability-enhanced models (e.g., LongWriter-Zero) and Agent-enhanced systems (e.g., SuperWriter) performed comparably to base models, suggesting that generation strategy alone does not solve consistency issues.
Dominant Error Types:
- Factual & Detail Consistency and Timeline & Plot Logic are the most common failure modes, indicating that entity tracking and temporal reasoning remain primary challenges.
- Generation tasks (open-ended creation) consistently yielded higher error densities than continuation or completion tasks.
Error Scaling & Distribution:
- Linear Accumulation: Error counts increase approximately linearly with output length.
- Positional Bias: Contradictions tend to cluster in the middle of narratives (40–60% range), while facts are established early (15–30%). This suggests a breakdown in long-range memory retention.
- Uncertainty Correlation: Error-bearing segments exhibit significantly higher token-level entropy (12–19% higher) than the whole-text baseline. This indicates models make errors when they are less confident, suggesting entropy can serve as an early warning signal.
Error Co-occurrence:
- Factual & Detail Consistency acts as a central hub, strongly correlating with Characterization, World-building, and Timeline errors.
- Narrative & Style errors are largely independent of factual/logical errors, suggesting they arise from different mechanisms.

5. Significance and Future Directions

Diagnostic Value: The paper moves beyond simple "pass/fail" metrics to provide a granular diagnosis of where and why models fail in long-form generation.
Actionable Insights: The discovery that errors correlate with high entropy suggests that future systems could implement uncertainty-aware verification (e.g., triggering self-correction when local entropy spikes) to proactively curb consistency failures.
Community Resource: The authors provide an interactive portal for the community to discover new errors and submit checking techniques, fostering collaborative improvement in long-form generation.
Limitations: The current benchmark focuses on English fiction with Western narrative conventions. Future work should extend to multilingual contexts, non-fiction domains (technical documentation), and distinguishing intentional narrative devices (e.g., unreliable narrators) from genuine errors.

In summary, this work establishes a rigorous standard for evaluating long-form story generation, revealing that while models are improving, they still struggle with the "long-term memory" required to maintain a coherent narrative universe over thousands of tokens.