Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

Imagine the United Nations Security Council as a massive, ancient library. Inside, there are thousands of documents (resolutions) written over the last 80 years. Some are crisp and modern, but many are old, typed on clunky typewriters, scanned into computers, and then read by robotic eyes (OCR) that often make mistakes. Some pages even have two columns of text side-by-side, making it look like a jumbled puzzle where English and French are mixed up on the same line.

The goal of this paper is to teach a team of AI robots (Large Language Models or LLMs) how to clean up this messy library and put color-coded sticky notes (semantic tags) on the important parts, like "Who," "Where," "When," and "What happened."

Here is the simple breakdown of how they did it:

1. The Problem: The "Noisy" Documents

Think of the old UN documents as a broken radio. The signal is there, but it's full of static, missing words, and weird breaks. If you try to read them, it's hard to understand.

The Challenge: If you ask a single AI to fix this, it might get confused. Sometimes it might accidentally delete a word (under-generate) or invent a new sentence that never existed (hallucinate/over-generate). We need the AI to be a faithful librarian, not a creative writer.

2. The Solution: The "AI Ensemble" (The Panel of Judges)

Instead of trusting just one AI robot to do the job, the researcher set up a panel of judges.

They used different versions of AI (some big and powerful like GPT-4.1, some small and cheap like GPT-4.1-mini).
They asked each robot to fix the text and add the tags twice.
This created a "crowd" of different attempts. Some might be perfect, some might be slightly off.

3. The Scorecard: How to Pick the Winner

How do you know which robot did the best job without a human reading every single page? The author invented two simple scorecards:

Score 1: The "Copy-Paste" Score (Content Preservation Ratio)
- Analogy: Imagine you are photocopying a rare, fragile painting. You want the copy to look exactly like the original, just with a frame added.
- The Metric: The system checks: "Did the AI accidentally erase a word? Did it add a word that wasn't there?" It counts how many letter-pairs (like "th" or "ing") stayed the same. If the score is high, the AI was faithful. If it's low, the AI messed up the original text.
Score 2: The "Neatness" Score (Tag Well-Formedness)
- Analogy: Imagine wrapping a gift. You need to make sure every piece of tape has a matching end, and you don't leave the box half-open.
- The Metric: The system checks if the AI closed its tags correctly (e.g., <date> must have a </date> at the end). If the AI leaves a tag open or mixes them up, the score drops.

4. The Process: The "Tournament"

For every single document:

Run the race: All the AI models run the task (cleaning and tagging) multiple times.
Check the scorecards: The system calculates the "Copy-Paste" and "Neatness" scores for every attempt.
Pick the champion: The system automatically selects the single best output from the crowd.
Save it: That perfect version is saved, and the messy original is replaced.

5. The Big Discovery: You Don't Need the Most Expensive Robot

The most exciting part of the paper is the cost vs. performance finding.

Usually, people think you need the biggest, most expensive AI (the "Super Robot") to get the best results.
The Reality: The researcher found that a smaller, cheaper robot (GPT-4.1-mini) could do almost the exact same job as the big one, but for only 20% of the cost.
Analogy: It's like realizing you don't need a Formula 1 race car to drive to the grocery store; a small, efficient electric car gets you there just as fast and costs a fraction of the gas money.

6. Why This Matters

By using this "Ensemble" method (using a team of AIs and a strict scorecard), the researcher created a reliable, automated system that can:

Clean up 80 years of messy history.
Turn unstructured text into a structured Knowledge Graph (a giant digital map connecting people, places, and events).
Do it cheaply enough to be used for thousands of documents, not just a few.

In a nutshell: This paper teaches us how to build a smart, self-correcting team of AI robots that can clean up messy historical documents and organize them perfectly, without wasting money on the most expensive tools or making up fake facts. It's about using math and teamwork to make AI trustworthy.

Here is a detailed technical summary of the paper "Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions" by Hussein Ghaly.

1. Problem Statement

The paper addresses the challenge of converting historical, unstructured, and noisy United Nations Security Council (UNSC) resolutions (dating from 1946 to 2025) into machine-readable, semantically tagged data suitable for Knowledge Graph construction.

Specific Challenges:

Data Noise: Early resolutions were typed, scanned, and processed via Optical Character Recognition (OCR), resulting in typographical errors, OCR artifacts, and formatting issues.
Complex Formatting: Documents prior to the 1980s often utilized a two-column layout where the second column contained French translations of the English text in the first column. Standard NLP tools struggle to parse these lines correctly, often merging unrelated text chunks.
LLM Variability: Large Language Models (LLMs) are stochastic; the same prompt can yield different outputs due to temperature settings. This makes it difficult to guarantee consistency, faithfulness to the original text, and the absence of hallucinations (additions/omissions).
Measurability: There is a lack of robust metrics to evaluate which LLM output is "best" for transformation tasks where the goal is to preserve the original content while adding specific tags.

2. Methodology

The author proposes an Ensemble LLM Pipeline that leverages model variability to select the optimal output based on custom evaluation metrics. The process is divided into two sequential tasks:

A. Data Cleaning

Goal: Convert two-column OCR text into a single-column format, correct OCR errors, remove printing hyphens/line breaks, and separate English/French texts.
Prompt Strategy: A fixed prompt instructs the model to "Include answer only" and perform specific cleaning operations without altering the semantic content.
Ensemble Approach: The system runs the cleaning prompt on a single document multiple times (2 runs per model) across 7 different models (ranging from GPT-4.1 to GPT-5.1, including mini and nano variants).
Selection: For each document, the output with the highest Content Preservation Ratio (CPR) is selected as the final cleaned text.

B. Semantic Tagging

Goal: Annotate the cleaned text with XML tags for specific entities: <location>, <entity>, <event>, <organization>, and <date>.
Constraint: The model must preserve the input text exactly, adding only the tags.
Ensemble Approach: Similar to cleaning, multiple runs are performed across the 7 models.
Selection: Outputs are sorted by a hierarchy of metrics:
1. Content Preservation Ratio (CPR): Primary filter for faithfulness.
2. Tag Well-Formedness (TWF): Ensures valid XML structure (matching open/close tags).
3. Number of Tags (nT): Optimizes for recall (identifying as many entities as possible).

C. Evaluation Metrics

To overcome the lack of a "Gold Standard" dataset, the paper introduces two novel metrics:

Content Preservation Ratio (CPR): Measures the similarity between input and output character bigram frequencies.
- Formula: $CPR = \frac{\sum b \cdot c_{in}(b)}{\sum b \cdot c_{in}(b) - \sum b |c_{in}(b) - c_{out}(b)|}$
- It penalizes both omissions (lower frequency in output) and additions (higher frequency in output).
Tag Well-Formedness (TWF): A ratio calculating the number of correctly closed tag pairs versus malformed tags (e.g., unclosed tags or mismatched nesting).

3. Key Contributions

Novel Evaluation Metrics: Introduction of CPR and TWF to objectively measure LLM performance in text transformation tasks, specifically addressing hallucination and structural integrity.
Ensemble Selection Strategy: A methodology that uses model stochasticity (multiple runs) and metric-based selection to derive the most reliable output, rather than relying on a single model run.
Cost-Accuracy Trade-off Analysis: Empirical evidence demonstrating that smaller, cheaper models (e.g., GPT-4.1-mini) can achieve performance comparable to larger models at a fraction of the cost.
Dataset Creation: The generation of a semantically annotated corpus of UNSC resolutions, serving as a foundation for Knowledge Graph construction.

4. Results

The system was tested on a sample of 10 documents.

Cleaning Task:
- Best Performance: GPT-4.1 achieved the highest CPR (84.9%).
- Cost Efficiency: GPT-4.1-mini achieved a CPR of 83.5% at only 20% of the cost of the best model ($0.0028 vs $0.0139 per document).
- Observation: Smaller "nano" models were faster but performed significantly worse in cleaning noisy text.
Semantic Tagging Task:
- Best Performance: GPT-4.1 and GPT-5.1 showed the highest metrics.
  - GPT-4.1: CPR 99.99%, TWF 99.92%, ~92.6 tags found.
  - GPT-5.1: CPR 99.95%, TWF 99.91%, ~93.1 tags found.
- Cost Efficiency: GPT-4.1-mini achieved comparable CPR (99.92%) and TWF (99.64%) at ~19% of the cost of GPT-4.1, though it identified fewer tags (80.1).

5. Significance and Future Work

Scalability: The approach provides a scalable, automated solution for processing massive archives of historical legal documents that were previously too noisy for standard NLP.
Economic Impact: The findings suggest that for data-intensive tasks, organizations can reduce LLM inference costs by up to 80% by utilizing smaller models within an ensemble framework without sacrificing significant accuracy.
Knowledge Graphs: The output (structured XML) is directly compatible with standards like Akoma Ntoso, enabling the construction of robust Knowledge Graphs representing UN mandates, entities, and events.
Future Directions: The author proposes evolving the ensemble from a "single-best-output" selector to a true integrator that merges outputs from multiple models to increase confidence in tag identification (consensus voting) and further aligning the tagging schema with strict Akoma Ntoso XML schemas.

In conclusion, this paper demonstrates that by combining ensemble techniques with rigorous, task-specific metrics, LLMs can be effectively deployed for high-stakes, data-cleaning, and semantic extraction tasks on challenging historical datasets.