From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation

Imagine you are trying to understand a massive, chaotic storm. In the world of economics, this storm is inflation. But instead of just looking at the rain (the numbers), economists want to understand the story of the storm: Why did it start? Did a broken dam cause it? Did a sudden heatwave dry up the rivers?

This paper is about how to tell those stories clearly, accurately, and consistently, even when different people tell them slightly differently.

Here is the breakdown of the research, using some everyday analogies.

1. The Problem: Everyone Tells the Story Differently

When you ask five different people to summarize a news article about why prices are rising, you'll get five slightly different versions.

Person A might focus on "supply chain issues" (like trucks getting stuck).
Person B might focus on "government spending" (like printing too much money).
Person C might draw a map connecting these ideas, while Person D just lists them.

In the world of computers (Natural Language Processing), this is a nightmare. If a computer is trying to learn from these stories, it gets confused: "Wait, is 'truck stuck' the same as 'supply chain'? And why did Person C draw a line between them but Person D didn't?"

This confusion is called Human Label Variation (HLV). It's not that anyone is "wrong"; it's just that humans interpret complex stories in different, valid ways.

2. The Solution: A "Detective's Notebook" (Qualitative Content Analysis)

The researchers realized that standard computer methods (which usually just slap a single label on a text) weren't good enough for complex stories. So, they borrowed a tool from social scientists called Qualitative Content Analysis (QCA).

Think of QCA as a detective's notebook rather than a multiple-choice quiz.

The Old Way: "Is this about inflation? Yes/No."
The QCA Way: The researchers created a detailed, evolving rulebook. They started with a list of suspects (categories like "Energy Prices," "Labor Shortages," "War"). As they read articles, they realized some clues didn't fit the old list. So, they held group meetings, argued, refined the rules, and added new categories (like "Climate Crisis" or "Education Costs").

This process ensured that everyone (the "detectives") was looking for the same clues in the same way, reducing mistakes before they even started.

3. The Map: Turning Stories into Graphs

Instead of just writing a summary, the researchers turned these stories into maps (called Directed Acyclic Graphs, or DAGs).

Nodes (Dots): These are the events (e.g., "Oil Prices Went Up").
Edges (Lines): These are the arrows showing cause and effect (e.g., "Oil Prices Went Up" $\rightarrow$ "Inflation Increased").

Imagine a "Choose Your Own Adventure" book where you draw lines connecting the choices. The goal was to see if different people would draw the same map when reading the same article.

4. The Experiment: How Strict Should We Be?

The researchers ran a big experiment to figure out how to measure if two people drew the same map. They tested three different "rulers" (distance metrics):

The "Loose" Ruler (Lenient): "Did you mention any of the same dots?"
- Result: This gave high scores, but it was a lie. It was like saying two maps are identical just because they both have a dot for "New York," even if one map is of the US and the other is of Europe. It overestimated agreement.
The "Strict" Ruler (Strict): "Did you draw the exact same map with the exact same lines?"
- Result: This was too harsh. Even if two people understood the story perfectly, if one person drew a tiny extra line, the score crashed. It punished valid differences in storytelling.
The "Middle" Ruler (Moderate): "How much of the map overlaps?"
- Result: This was the sweet spot.

The Big Discovery:
They found that if you try to map the entire story (every single detail), people disagree a lot. But, if you zoom in and only map the immediate neighbors (the events directly causing inflation), the maps look very similar.

Analogy: If you ask people to draw the whole history of the universe, they will disagree on the details. But if you ask them to draw "What happened right before the cake burned?", they will all draw a very similar picture.

5. The Takeaway: "Good Enough" is Better Than "Perfect"

The paper concludes with a practical guide for anyone trying to analyze news stories with computers:

Don't trust the "Loose" ruler: Just because two people mention the same words doesn't mean they agree on the story.
Focus on the core: To get reliable results, focus on the immediate causes (the "Adjacent Story") rather than trying to capture every single background detail.
Embrace the mess: It's okay that humans interpret stories differently. The goal isn't to force everyone to think exactly alike, but to understand where and why they differ.

In short: The researchers built a better way to turn messy human news stories into clean computer data. They learned that if you keep the map simple and focus on the direct causes, everyone agrees much more easily. This helps computers learn to understand economic stories without getting confused by the natural differences in human perspective.

Here is a detailed technical summary of the paper "From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation."

1. Problem Statement

The paper addresses three critical challenges in the Natural Language Processing (NLP) domain regarding the annotation and evaluation of narrative graphs (specifically for economic narratives like inflation):

Subjectivity and Complexity: Unlike standard sequence classification, narrative annotation requires identifying and linking events across large text segments. This involves subjective, context-dependent interpretive judgments that are difficult to standardize using conventional NLP frameworks.
Human Label Variation (HLV): Graph-based representations introduce significant variability. Annotators may differ in identifying relevant events, choosing relation types, or determining graph granularity. Existing literature acknowledges that multiple plausible annotations often exist for the same text, yet current methods struggle to account for this inherent variation.
Lack of Standardized Evaluation: There is no consensus on how to measure Inter-Annotator Agreement (IAA) for graph structures. Existing distance metrics (e.g., Graph Edit Distance) vary in strictness, and their relevance depends on specific analytic goals. Without standardized metrics, it is difficult to distinguish between genuine disagreement and valid alternative interpretations.

2. Methodology

The authors propose a novel framework integrating Qualitative Content Analysis (QCA) with NLP to improve annotation quality and evaluation rigor.

A. Qualitative Content Analysis (QCA) Integration

Instead of a static annotation guideline, the authors employed an iterative QCA approach:

Iterative Refinement: They conducted a pilot study involving group discussions to resolve ambiguities in the category system and coding guidelines.
Category System: Starting from existing economic theories (Andre et al., 2026), they inductively extended the system to include 26 fine-grained subcategories (e.g., Supply Chain Issues, Wages, Inflation Expectations) covering Demand, Supply, and Miscellaneous causes of inflation.
Pre-annotation: To reduce cognitive load and improve span identification, they utilized a zero-shot Named Entity Recognition model (Gliner) to highlight candidate spans, which annotators could then accept, modify, or reject.

B. Dataset Construction

Source: Dow Jones Newswires (DJN) corpus, focusing on inflation narratives.
Sampling Strategy: Refined from a temporal stratified approach to sampling only from "inflation-peak years" (1990–2023) and using an LLM (LLaMA-3) for zero-shot filtering to ensure a high proportion of "inflation-cause-dominant" documents.
Annotation Task:
1. Task 1 (Classification): Classify documents as Inflation-cause-dominant, Inflation-related, or Non-inflation-related.
2. Task 2 (Extraction): Extract event spans and causal relations (Increases/Decreases) to form Directed Acyclic Graphs (DAGs) where nodes are events and edges are causal links.

C. Evaluation Framework (6 × 3 Factorial Design)

To evaluate annotation reliability under HLV, the authors designed an experiment varying two factors:

Narrative Representation (6 Levels):
- Categorical: All Events, Adjacent Events (1-hop), Relations.
- Graph-based: Full Story (multi-hop), Adjacent Story (1-hop), Extended Story (multi-hop).
Distance Metric Type (3 Levels):
- Lenient: Overlap-based (0 if any overlap exists, 1 otherwise).
- Moderate: Jaccard distance (proportion of shared elements).
- Strict: Exact match (0 only if identical sets/graphs).

They computed Krippendorff's Alpha ( $\alpha$ ) for all 18 combinations to measure Inter-Annotator Agreement. Notably, they implemented a custom graph-based version of Krippendorff's $\alpha$ (open-sourced) to handle graph features.

3. Key Contributions

QCA-Based Annotation Framework: A systematic methodology for narrative graph annotation that prioritizes quality and handles HLV by iteratively refining category systems through human discussion, rather than relying solely on static guidelines.
Graph-Based IAA Evaluation: The development and open-sourcing of a Krippendorff's $\alpha$ implementation specifically for graph annotations, allowing for the measurement of agreement across nodes, edges, and full graph structures.
Empirical Analysis of Representation and Metrics: A comprehensive study demonstrating how the choice of representation and distance metric drastically alters reliability scores, providing guidelines for future research.
Inflation Narrative Dataset: A high-quality dataset of 488 annotated documents (104 for graph extraction) representing inflation narratives as DAGs.

4. Key Results

Metric Sensitivity: Lenient metrics (overlap-based) significantly overestimate reliability. For example, the "All Events" representation showed an $\alpha$ of 0.868 under lenient metrics but dropped to 0.244 under strict metrics. This indicates that simple overlap masks the true divergence in how annotators interpret causal structures.
Representation Granularity: Locally-constrained representations yield higher consistency.
- The "Adjacent Story" (1-hop subgraph of events directly causing inflation) achieved the best balance, with moderate $\alpha$ of 0.441 and strict $\alpha$ of 0.202.
- Full multi-hop graphs ("Full Story") showed the largest drop in reliability between lenient and strict metrics, highlighting a trade-off between contextual completeness and annotation consistency.
Event vs. Relation: Annotators showed higher consistency in identifying directional relations (e.g., "Increases") than in identifying specific event nodes or full graph structures.
Disagreement Patterns:
- In Task 1, annotators agreed most on "Non-inflation-related" documents and struggled most with "Inflation-cause-dominant" vs. "Inflation-related" distinctions.
- In Task 2, the least-agreed triples involved complex economic concepts like "Monetary Policy" and "Wages," suggesting these are the most subjective areas of narrative interpretation.

5. Significance and Implications

Redefining Reliability: The paper argues that in tasks involving human label variation, a single reliability score is insufficient. Researchers should report multiple scores across different metric granularities to capture the "spectrum" of agreement.
Best Practices for Graph Annotation: The study suggests that for robust evaluation, researchers should constrain narrative graphs to local structures (1-hop neighbors) rather than attempting to capture full multi-hop causal chains, as the latter introduces excessive variability without necessarily adding interpretive value.
Methodological Bridge: By integrating QCA (a social science method) into NLP, the authors demonstrate a path forward for handling subjective, interpretive tasks where "ground truth" is not a single label but a set of plausible interpretations.
Open Science: The authors open-sourced their graph-based Krippendorff's $\alpha$ implementation and the annotation guidelines, facilitating reproducibility and further research in narrative understanding.

In conclusion, the paper moves the field from seeking a single "correct" annotation (invariance) to understanding and quantifying the spectrum of human interpretation (variance), providing a rigorous framework for evaluating complex narrative structures.