Critical Confabulation: Can LLMs Hallucinate for Social Good?

Imagine you are trying to tell the story of a life, but someone has ripped out several pages from the middle of the biography. The pages are gone, and the history books are silent about what happened during those missing years. This is the reality for many "hidden figures" in history—people whose stories were erased by racism, poverty, or political violence.

This paper asks a bold question: Can we use Artificial Intelligence (AI) to fill in those missing pages in a way that is helpful, rather than just making things up?

Here is the breakdown of their idea, using simple analogies.

1. The Problem: The "Silent Archive"

Think of history as a giant library. For centuries, the librarians (historians and governments) only kept books about powerful people. The stories of enslaved people, the poor, and marginalized communities were often thrown in the trash or never written down.

When historians try to tell these stories today, they hit "blank spots." They know a person existed, but they don't know what they did on a specific Tuesday in 1854. Traditional history says, "If there is no evidence, we can't say anything." But the authors argue that this silence hurts us. It leaves the "hidden figures" invisible.

2. The Solution: "Critical Confabulation"

Usually, when AI makes things up, we call it a "hallucination," and we treat it like a bug. If an AI says the moon is made of cheese, that's a failure.

But the authors propose a new concept called Critical Confabulation.

The Analogy: Imagine you are a detective trying to solve a cold case. You have a witness who remembers the suspect wore a red hat and ran down the street, but they forgot why the suspect ran.
The Old Way: The detective says, "I don't know, so I won't guess."
The Critical Confabulation Way: The detective uses all the known facts (red hat, running, the time of day, the neighborhood) to construct a plausible story of why the suspect ran. They aren't claiming it's 100% proven fact; they are offering a "best guess" narrative to help visualize the missing piece.

The authors want to use AI to do this for history. They want the AI to look at the "gaps" in the records and write a story that fits the context, helping us imagine what life might have been like for those erased people.

3. The Experiment: The "Fill-in-the-Blank" Test

To see if AI is good at this, the researchers created a test.

The Setup: They took real, unpublished historical documents about Black intellectuals and activists (people the AI likely hasn't seen before).
The Game: They took a timeline of a person's life and erased one event, replacing it with a black box: [MASKED].
The Task: They asked various AI models to guess what happened in that black box.
The Goal: Did the AI just make up nonsense? Or did it write a story that felt true to the character and the time period, even if it couldn't be 100% proven?

4. The Results: AI Can Be a "Creative Historian"

The results were surprising and promising:

It's hard, but possible: The AI didn't get it right every time (only about 50-60% of the time), but it was often able to generate stories that felt "right" and matched the tone of the era.
Hints help: When the researchers told the AI, "This missing event was about a job change" or "This was about a family argument," the AI got much better at guessing.
The "Small" models are smart: Some smaller, open-source AI models performed just as well as the massive, expensive ones. This is great news because it means this tool could be accessible to more researchers.
No cheating: The researchers made sure the AI hadn't just memorized the answers from its training data. They used a "detective" method to ensure the AI was actually reasoning, not just reciting facts it already knew.

5. Why This Matters

The authors aren't saying AI should replace historians. Instead, they see AI as a co-pilot for storytelling.

For Historians: It's a tool to brainstorm ideas. If an AI suggests, "Maybe this person attended a secret meeting in 1920," a human historian can then go look for evidence to prove or disprove it. It turns a blank page into a hypothesis to investigate.
For Society: It helps "re-humanize" history. Instead of just seeing a list of dates and names, we get a narrative that helps us understand the lives of people who were systematically silenced.

The Bottom Line

Think of this paper as a new kind of archaeological brush. For a long time, we thought AI's tendency to "make things up" was a flaw. This paper argues that if we use that "making up" skill carefully—grounded in real facts and ethical boundaries—it can actually help us recover lost voices and fill in the gaps of our collective memory.

It's not about replacing the truth with fiction; it's about using imagination to find the truth that was hidden in the silence.

Here is a detailed technical summary of the paper "Critical Confabulation: Can LLMs Hallucinate for Social Good?"

1. Problem Statement

Large Language Models (LLMs) are notoriously prone to hallucinations (generating plausible but non-factual outputs), which are typically treated as a failure mode. However, in the context of historical archives, particularly those concerning marginalized groups (e.g., enslaved people, "hidden figures"), strict factuality often fails because the archives themselves are incomplete due to systemic erasure, violence, and inequality.

The paper addresses the challenge of Archival Silence: the absence of records for specific individuals due to historical oppression. Traditional historiography cannot reconstruct these lives without evidence. The authors propose leveraging the LLM's tendency to "confabulate" (generate self-consistent narratives) not as an error, but as a tool for Critical Confabulation. This concept, inspired by Saidiya Hartman's "critical fabulation," aims to use speculative storytelling to "fill in the gaps" of historical records with evidence-bound narratives that humanize erased figures, without collapsing speculation into false facts.

2. Methodology

A. Dataset Construction: The "Hidden Figures" Benchmark

To evaluate this capability, the authors constructed a rigorous dataset based on the Black Writing and Thought Collection (BWTC), a corpus of 20,686 documents (mostly unpublished primary sources) related to Black intellectual history.

Data Contamination Audit: A critical step was ensuring models had not seen the test data during pretraining. The authors performed a two-stage audit on the OLMo-2 family (fully open models with public training data):
1. String Search: Exact sentence-level substring matching using the Boyer-Moore algorithm against the training set.
2. Behavioral Probe: Measuring cosine similarity between model continuations of "seen" vs. "unseen" text.
- Result: 21% of documents were flagged as "SEEN" and removed. The final evaluation set consists of 156 "hidden figures" (rare names) and their associated ground-truth timelines extracted from the "UNSEEN" portion of the corpus.
Ground Truth Extraction: Using a long-context LLM (GPT-O3) with strict source-bounding instructions, the authors extracted chronological timelines for each figure. Each event was summarized as a single sentence and tagged with an event type (Agentive, Relational, Observational, Cognitive, Role).

B. Task Formulation: Narrative Cloze

The core task is an open-ended narrative cloze task:

A timeline $T(n)$ for a hidden figure $n$ is provided.
One event $e_m$ is replaced with a [MASK] token.
The LLM must reconstruct the masked event $e_m$ based on the surrounding context and a fixed instruction prompt.
Evaluation Metric: The generated event is compared to the ground truth using story-emb, a narrative-focused embedding model. A prediction is considered "correct" if the cosine similarity exceeds a tuned threshold ( $\epsilon = 73.13$ ), maximizing macro-F1.

C. Experimental Setup

Models: Evaluated audited OLMo-2 variants (1B–32B) against unaudited open-weight models (Qwen, Gemma, Llama, Mistral) and proprietary models (GPT-4o, GPT-5-Chat).
Prompting Strategies: Tested various prompts designed to elicit controlled hallucinations, including:
- Null-Shot: Referencing a non-existent example to encourage "filling in."
- Eccentric Automatic Prompts: Framing the model as a "mission-driven historian."
- HaluEval: Explicitly asking for plausible but incorrect candidates (adapted for reconstruction).
- Event Type Hints: Providing the category of the missing event (e.g., "Role").
Stochastic Ablation: Tested temperatures (0.2, 0.7, 1.2) to assess the impact of sampling randomness.

3. Key Contributions

Conceptual Framework: Introduces Critical Confabulation as a viable AI workflow for social good, reframing hallucination as a resource for reconstructing erased histories within ethical bounds.
Rigorous Benchmarking: Created a contamination-free benchmark using unpublished archival data and a novel "hidden figure" selection process, ensuring that model performance reflects genuine narrative reasoning rather than memorization.
Operationalization: Demonstrated how to operationalize critical fabulation (a humanities methodology) into a computational task (narrative cloze) that scales the recovery of divergent narratives.
Empirical Insights: Provided a comprehensive analysis of how prompt engineering, model architecture, and event characteristics influence the ability to generate evidence-bound speculative histories.

4. Key Results

Feasibility: Critical confabulation is a challenging but feasible task. While most models stay below 50% accuracy, top performers (e.g., GPT-5-Chat) reached 59.7% accuracy under specific prompts.
Prompt Sensitivity: Performance is highly sensitive to prompt design.
- Event Type Hints: Providing the event category (e.g., "Role") consistently improved performance by +2–10% across models.
- Prompt Types: "Null-Shot" and "LLM-Discussion" prompts generally yielded the best results, while "HaluEval" (designed to generate distractors) performed poorly for reconstruction.
Model Performance:
- GPT-5-Chat was the overall leader.
- OLMo-2-7B (audited) performed surprisingly well (58.8%), often surpassing larger unaudited baselines, suggesting that open models can effectively learn the task without data contamination.
- Small vs. Large: Smaller models (e.g., Qwen3-4B) sometimes outperformed larger variants, indicating that parameter count is not the sole determinant of narrative capability in this domain.
Event Characteristics:
- Easiest: "Role" events (biographical facts, titles) were reconstructed most accurately.
- Hardest: "Cognitive" events (internal states, opinions) were the most difficult, likely due to a lack of observable context.
- Length Effects: Longer event descriptions correlated with higher accuracy, but longer timelines correlated with lower accuracy, indicating difficulty with long-range narrative dependencies.
Memorization Check: There was no significant performance advantage for unaudited models over audited OLMO-2 models, suggesting the task relies on reasoning rather than data memorization.

5. Significance and Implications

For NLP: The paper challenges the binary view of hallucination as purely negative. It demonstrates that controlled hallucination can be an optimizable resource for specific domain applications, particularly where strict factuality is impossible due to data gaps.
For Humanities & Social Sciences: It offers a "technology of recovery" for scholars. LLMs can act as co-creators, helping humanists scale the labor-intensive process of reconstructing the lives of marginalized figures from fragmentary archives.
Ethical Considerations: The authors emphasize that this approach requires narrative ethics. The goal is not to invent new realities but to generate "high-fidelity counter-narratives" grounded in the context of the lacuna (the gap). The workflow includes safeguards (source-bounding, human-in-the-loop validation) to prevent compounding archival violence.
Future Directions: The work suggests a paradigm shift in how AI interacts with cultural history, moving from simple retrieval to generative reconstruction of the "unknown unknowns" in historical records.

In conclusion, the paper successfully argues that with careful constraints and ethical framing, LLMs can be harnessed to perform critical confabulation, turning a known weakness (hallucination) into a powerful tool for social justice and historical recovery.