Causality Elicitation from Large Language Models

Imagine you have a super-smart, well-read robot (a Large Language Model, or LLM) that has read almost everything ever written on the internet. You ask it a question like, "How will Trump's new trade policies affect Japan's economy?"

The robot doesn't just give you one answer. Instead, it writes 100 different stories (documents) about this topic. Each story is slightly different, using different words to describe the same ideas. One story might say, "The US raises taxes on imports," while another says, "Tariffs on foreign goods go up," and a third says, "Protectionism gets stricter."

To a computer, these are three totally different things. But to a human, they are basically the same event.

This paper proposes a clever five-step recipe to turn those 100 messy stories into a clear, visual map of "what causes what" according to the robot's knowledge.

Here is the recipe, explained with some everyday analogies:

1. The "Story Generator" (Step i)

First, we ask the robot to write many short stories about a specific topic. Think of this like asking a room full of 100 different journalists to write a headline about the same news event. You get a lot of variety, but also a lot of repetition.

2. The "Event Hunter" (Step ii)

Next, we go through each story and pull out the specific "events" mentioned.

Story A: "The Fed raised rates."
Story B: "Interest rates went up."
Story C: "Monetary policy tightened."
We collect all these sentences into a giant pile. Right now, it's a messy pile of sticky notes with different handwriting.

3. The "Translator & Sorter" (Step iii) — The Most Important Step

This is the magic trick. The robot is great at writing, but terrible at realizing that "raising rates" and "interest rates up" are the same thing. If we don't fix this, our map will be a tangled mess.

So, we use a two-part system:

The Semantic Sorter: We use a tool that understands the meaning of words (like a translator who knows that "big" and "huge" mean the same thing). It groups similar sticky notes together.
The Human-Like Editor: Once the notes are grouped, we ask the LLM to give each group a single, clean name.
- Group: "Rates up," "Fed hike," "Interest rates higher."
- New Name: "Interest Rate Hike."

Now, instead of 100 different phrases, we have a clean list of about 20 or 30 unique "Canonical Events." It's like turning a chaotic pile of ingredients into a neat, labeled spice rack.

4. The "Scorecard" (Step iv)

Now we create a giant spreadsheet (a matrix).

Rows: The 100 stories.
Columns: The 30 clean event names (like "Interest Rate Hike," "Tariff Increase," "Oil Price Spike").
The Cells: We put a "1" if the story mentions that event, and a "0" if it doesn't.

Suddenly, we have a clean, organized dataset. We've turned 100 paragraphs of text into a simple grid of numbers.

5. The "Detective" (Step v)

Finally, we hand this spreadsheet to a "Causal Detective" (a mathematical algorithm). The detective looks at the patterns:

"Hey, every time the story mentions 'Tariff Increase,' it also mentions 'Supply Chain Delay'."
"But 'Interest Rate Hike' usually happens before 'Stock Market Drop'."

The detective draws a map (a graph) showing arrows connecting these events.

Arrow: Tariff Increase ➔ Supply Chain Delay
Arrow: Interest Rate Hike ➔ Stock Market Drop

What is the Result?

The final output is a Hypothesis Map.

It is not a map of reality. It is a map of what the robot believes is true based on all the data it has read.

The Catch: The robot might be wrong. Maybe in the real world, tariffs don't cause supply chain delays immediately. But the robot thinks they do because it read that in many books.
The Value: This map gives human experts a starting point. Instead of guessing what the robot knows, we can look at the map and say, "Ah, the robot thinks A causes B. Let's check if that's actually true in the real world."

The Big Picture Analogy

Imagine you are trying to understand how a complex machine works, but you can't open the hood. Instead, you have 100 different mechanics (the LLM) who have all looked at the machine and written down what they think happens when you turn the key.

Their notes are messy and use different slang.

You collect all the notes.
You translate their slang into a standard technical language.
You organize the notes into a checklist.
You ask a logic machine to draw a diagram of how the mechanics think the machine works.

The result isn't the actual engine blueprint, but it's a very good guess at what the engine might look like, which helps the real engineers know where to start their investigation.

In short: This paper teaches us how to turn a robot's messy, wordy stories into a clean, visual diagram of "cause and effect," so humans can inspect the robot's logic and decide what to trust.

Here is a detailed technical summary of the paper "Causality Elicitation from Large Language Models" by Kameyama et al.

1. Problem Statement

The paper addresses the challenge of extracting causal hypotheses from the latent knowledge encoded in Large Language Models (LLMs). While LLMs can generate narratives containing causal reasoning, converting these unstructured texts into a rigorous, inspectable causal structure is difficult due to:

Surface-form variation: The same underlying event (e.g., "tariff tightening") may be described with different phrasing across generated documents, leading to the "variable identity" problem.
Lack of structured data: Standard causal discovery algorithms require a structured matrix of variables, but LLM outputs are free-form text.
Ambiguity of Causality: The goal is not to prove real-world causality but to externalize the causal hypotheses and dependency structures that an LLM plausibly assumes based on its training data.

2. Methodology

The authors propose a five-step pipeline to transform LLM-generated narratives into a causal graph. The core innovation lies in Step (iii) and (iv), which bridge the gap between unstructured text and causal discovery.

Step (i): Topic-Conditioned Document Generation

The system prompts an LLM to generate $N$ analytical documents on a specific topic (e.g., "Trump's policy impact on Japan").
The LLM is instructed to act as an analyst, grounding discussions in concrete events, mechanisms, and outcomes.

Step (ii): Event Extraction

An LLM extracts a list of event phrases (e.g., policy actions, market shifts) from each document.
A robust parsing layer normalizes outputs (JSON, lists, bullets) into a standardized list of strings.

Step (iii): Event Canonicalization (The Core Innovation)

To solve the variable identity problem, the authors propose an embedding-first canonicalization strategy:

Embedding: All unique raw event strings are embedded using a vector model (e.g., text-embedding-3-large).
Clustering: A clustering algorithm (MiniBatchKMeans) groups semantically similar events into $K$ clusters.
Naming: For each cluster, an LLM is prompted to generate a single, human-readable canonical label (e.g., "Tariff Tightening") that represents the cluster.
Mapping: A mapping function $f$ is created to map all raw event strings to their canonical labels.

Step (iv): Document–Event Incidence Matrix Construction

A binary matrix $Z \in \{0, 1\}^{N \times C}$ is constructed, where $N$ is the number of documents and $C$ is the number of canonical events.
$Z_{i,c} = 1$ if document $i$ contains any raw event that maps to canonical event $c$ .
This aggregation ensures stability and reduces dimensionality, creating a clean dataset for causal discovery.

Step (v): Causal Discovery

Standard causal discovery algorithms are applied to matrix $Z$ $Z$ to estimate candidate causal graphs:
- PC (Constraint-based): Uses conditional independence tests.
- GES (Score-based): Optimizes a scoring function to find a directed graph.
- LiNGAM (Functional model): Uses Independent Component Analysis to estimate causal ordering.
The resulting graphs are visualized and treated as "hypothesis maps."

3. Key Contributions

A Novel Pipeline: The first framework to systematically elicit causal structures from LLMs by combining document generation, semantic event canonicalization, and causal discovery.
Event Canonicalization Module: A specific contribution to NLP and causal modeling that treats event normalization as a policy-dependent step. It uses embeddings for clustering and LLMs for naming, ensuring variables are consistent across the dataset.
Hypothesis Mapping: The paper reframes the output not as "ground truth" but as an inspectable set of causal hypotheses encoded in the LLM, suitable for expert review and refinement.
Integration of Disciplines: It connects event extraction, entity resolution, semantic deduplication, and causal inference into a unified workflow.

4. Empirical Results

The authors validated the pipeline with two case studies ( $N=100$ documents each):

Case Study 1: Trump's Policy Impact on Japan (2026+)
- Setup: 30 canonical events derived from 100 documents.
- Findings: The PC algorithm identified three distinct mechanisms:
  1. Tech Restrictions $\to$ Localization $\to$ Japanese FDI: US export controls and tariffs drive Japanese companies to relocate production to the US.
  2. Trade-Rule Tightening: A bundle of USMCA rules, export controls, and "Buy American" policies reinforcing each other.
  3. Japan-side Response: Variables like "host-nation support" and "monitoring US policy" act as sinks, absorbing pressure from US-side variables.
- Validation: The identified mechanisms aligned with existing economic literature and expert intuition.
Case Study 2: US AI Investment vs. Gold Prices
- Setup: 20 canonical events derived from 100 documents.
- Findings: The PC graph revealed two converging channels affecting gold prices:
  1. Macro-Financial Channel: AI investment $\to$ Growth/USD strength $\to$ Gold demand.
  2. Geopolitical Channel: AI chip export controls/Taiwan tensions $\to$ Central bank gold purchases.
- Insight: The graph successfully separated distinct causal pathways that co-occur in the LLM's narrative space.

5. Significance and Limitations

Significance:

Interpretability: Provides a structured way to "read" the causal logic of an LLM, moving beyond black-box generation.
Hypothesis Generation: Serves as a tool for researchers to generate initial causal models that can be tested against real-world data.
Scalability: The embedding-first approach allows the system to handle large volumes of unstructured text and merge them into a manageable variable set.

Limitations:

Binary Data Assumption: The incidence matrix is binary, whereas many causal algorithms (like LiNGAM) assume continuous data.
Temporal Ordering: The current method collapses within-document ordering; edges reflect conditional co-occurrence rather than strict temporal precedence unless explicitly constrained.
Canonicalization Trade-offs: There is a risk of false merges (grouping distinct events) or missed merges (splitting identical events), which depends on the analyst's granularity policy.
No Ground Truth Verification: The graphs represent the LLM's internal logic, not verified real-world causality. They require external validation by domain experts.

Conclusion

The paper presents a robust methodology for transforming the unstructured causal narratives of LLMs into structured, testable causal graphs. By solving the "variable identity" problem through semantic canonicalization, the authors enable the application of rigorous causal discovery algorithms to LLM-generated data, offering a powerful tool for hypothesis formulation and scenario analysis.