Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

Imagine you are trying to teach a robot to read a specific language—Luxembourgish. This language is spoken by about 400,000 people, but in the world of Artificial Intelligence (AI), it's considered "under-resourced." Think of it like a small, remote village that hasn't been mapped by the big GPS companies yet. To teach the robot, you need a massive library of books where the important names (like people, places, and organizations) are already highlighted and labeled. But nobody has written these books for Luxembourgish yet, and hiring humans to write them is expensive and slow.

This paper introduces a clever, three-step recipe to build that library automatically, using a mix of Wikipedia, Wikidata, and AI Judges.

Here is the story of how they did it, broken down into simple concepts:

1. The Problem: The "Empty Bookshelf"

For big languages like English or French, we have huge libraries of pre-labeled data. For Luxembourgish, the bookshelf is almost empty. The authors wanted to fill it up without hiring an army of human linguists.

2. Step One: The "Scavenger Hunt" (Distant Supervision)

Instead of writing new sentences from scratch, the team went to the Luxembourgish Wikipedia.

The Analogy: Imagine you are looking for clues in a treasure map. In Wikipedia, when a word is a "link" (like clicking on a name to see its history), it's a gold clue.
The Trick: They used a tool to find every linked name in the articles. Then, they checked Wikidata (a giant database of facts) to see what that link actually is.
- If the link goes to a person's page, they tag it PER (Person).
- If it goes to a city, they tag it LOC (Location).
- If it goes to a company, they tag it ORG (Organization).
The Result: They quickly generated thousands of sentences with labels. But, just like a scavenger hunt, some clues were misleading. Some links were broken, or the context was weird. The data was "noisy."

3. Step Two: The "Strict Editors" (LLM-as-a-Judge)

This is the most innovative part. They didn't hire humans to check every single sentence. Instead, they asked Large Language Models (LLMs) to act as editors.

The Analogy: Imagine you have a stack of 75,000 essays written by a student. You can't read them all. So, you hire a super-smart AI (the "Judge") to read them and say, "Keep this one, it's good," or "Throw this one away, it's nonsense."
The Experiment: They tested many different AI judges (some made by OpenAI, some by Google, some open-source). They asked the AI: "Look at this sentence and its labels. Is the labeling correct? Yes or No?"
The Winner: They found that the most advanced AI models (like GPT-5) were surprisingly good at this. They agreed with human experts about 62% of the time. That's close enough to say, "Okay, this AI is a reliable editor."

4. Step Three: The Final Library (The judgeWEL Dataset)

After the AI editors filtered out the bad sentences, they were left with a clean, high-quality dataset called judgeWEL.

It has 28,866 sentences.
It is 5 times larger than the previous best dataset for Luxembourgish.
It covers a wide variety of topics, not just news.

5. Did it Work? (The Test Drive)

The authors took this new library and taught different AI models to recognize names in Luxembourgish.

The Result: The models trained on this new, AI-cleaned library performed almost as well as models trained on human-labeled data.
The Catch: While the AI editors were great at spotting mistakes, the AI writers (generative models) were still a bit messy when trying to create the labels themselves. It's easier for an AI to say "This is wrong" than to say "Here is the perfect label."

The Big Takeaway

This paper proves that for small, under-represented languages, you don't need to wait for humans to label everything. You can use Wikipedia as a rough draft and AI Judges to polish it.

The Metaphor:
Think of building a language resource like building a house.

Old Way: Hire a team of masons (humans) to lay every single brick by hand. Slow and expensive.
New Way: Use a machine to dump a pile of bricks (Wikipedia data). Then, hire a very smart foreman (the AI Judge) to walk through, kick out the cracked bricks, and arrange the good ones. The house gets built 5x faster, and it's still strong enough to live in.

This approach offers a sustainable path to giving every language, even the small ones, a fair chance in the AI world.

1. Problem Statement

The paper addresses the critical bottleneck in Natural Language Processing (NLP) for under-represented and low-resource languages, specifically Luxembourgish.

Resource Scarcity: Luxembourgish lacks large-scale, manually annotated corpora required for supervised learning tasks like Named Entity Recognition (NER). Existing resources (e.g., RTL-NER) are small (5,500 sentences) and imbalanced.
Annotation Costs: Traditional manual annotation is expensive, slow, and difficult to scale due to a lack of linguistic expertise and funding.
The Challenge: How to construct a large, high-quality NER dataset for a low-resource language with minimal human intervention, while ensuring the quality of automatically generated labels is sufficient for training robust models.

2. Methodology

The authors propose a novel pipeline called JudgeWEL, which combines distant supervision from structured knowledge bases with LLM-based quality control. The process involves four main stages:

A. Data Source and Extraction

Source: Luxembourgish Wikipedia (77k articles).
Extraction: Used WikiExtractor to parse XML dumps, preserving hyperlinks which serve as the basis for entity identification.

B. Distant Supervision (Initial Labeling)

Entity Linking: Hyperlinked entities in Wikipedia sentences are matched to Wikidata entries via the API.
Type Inference: Entity types are inferred based on Wikidata attributes:
- PER (Person): Matches Q5 (human) or has birth/death dates.
- ORG (Organisation): Matches organizational QIDs (e.g., bank, hospital).
- LOC (Location): Matches location QIDs (e.g., town, country).
- DATE: Matches specific date QIDs.
Formatting: Entities are annotated using the BIO (Begin-Inside-Outside) tagging scheme.
Selection: To ensure diversity, the pipeline skips the first sentence of articles (often formulaic) and selects the next five. It filters out short sentences, all-caps text, and overlapping entity spans. This yielded 74,710 candidate sentences.

C. Annotation Refinement

Gap Filling: Since Wikipedia does not link all entities, a fine-tuned LuxGPT-NER model (trained on a smaller manual dataset) was used to re-evaluate unlabelled tokens.
Regex Correction: Regular expressions were applied to fix missing date tags and unify tag sets (e.g., merging GPE into LOC).

D. LLM-as-a-Judge (Quality Control)

Hypothesis: Large Language Models (LLMs) can reliably distinguish high-quality annotated sentences from noisy ones, even in languages they are not explicitly trained on.
Process: Several LLMs were prompted to act as binary judges (Keep/Discard) for each of the 74k sentences.
Models Evaluated:
- Proprietary: GPT-5, GPT-5-mini.
- Open-Weight: Gemma-3-27B-IT, Mistral-Medium-3.1, GPT-OSS-120B.
- Instruction-Tuned: LLaMA-3.3-8B-Instruct, command-a-03-2025.
Human Validation: Two native Luxembourgish speakers manually annotated 500 sentences to establish a "gold standard" consensus (Cohen's $\kappa$ = 0.66) for evaluating the LLMs.

3. Key Contributions

JudgeWEL Dataset: A new, open NER dataset for Luxembourgish containing 28,866 sentences (approx. 5x larger than the previous best). It covers five entity types: PER, ORG, LOC, DATE, MISC.
Novel Pipeline: An automated framework integrating Wikipedia/Wikidata distant supervision with LLM-based filtering, reducing the need for manual annotation.
Empirical Evaluation of LLM Judges: A comprehensive study comparing various LLMs on their ability to judge annotation quality in a low-resource language.
Benchmarking: Established new baselines for Luxembourgish NER using both encoder-based (BERT-family) and generative (LLM) models.

4. Results

A. LLM-as-a-Judge Performance

Top Performers: GPT-5 and GPT-5-mini achieved the highest agreement with human consensus (Cohen's $\kappa$ $κ$ = 0.62), nearly matching human inter-annotator agreement (0.66).
- Cost Efficiency: GPT-5-mini achieved the same results as GPT-5 at a fraction of the cost (~$25 vs. ~$180 for the full dataset).
Open Models: GPT-OSS-120B and Mistral-Medium-3.1 showed moderate to strong alignment ( $\kappa$ 0.45–0.47).
Poor Performers: Smaller open models like Gemma-3-27B-IT and LLaMA-3.3-8B-Instruct showed near-zero or negative correlation, often discarding valid negative examples (sentences with no entities) or failing to detect specific entity types (e.g., MISC).
Error Analysis:
- DATE entities were easiest for all models to judge.
- MISC entities were the most problematic, with top models failing to retain any human-approved MISC sentences.
- Negative Examples: High-end models correctly identified sentences with no entities, while some open models incorrectly flagged them as errors.

B. Downstream NER Performance

Encoder Models: Models trained on JudgeWEL achieved high F1 scores (>0.90). LuxemBERT and XLM-RoBERTa performed best, confirming that language-specific pretraining and cross-lingual transfer are effective.
Generative LLMs:
- Meta-Llama-3-8B-Instruct achieved an F1 of 0.92, matching encoder performance.
- Aya-expanse-8b scored 0.84.
- LuxGPT-NER (fine-tuned) struggled (F1 0.68), highlighting the difficulty autoregressive models face with strict token-level BIO alignment.
Cross-Dataset Generalization: Models trained on the automatically constructed JudgeWEL dataset generalized well to the manually annotated RTL-NER test set (F1 ~0.89–0.92), proving the quality of the distant supervision.

5. Significance and Conclusion

Validation of Hybrid Approach: The study confirms that combining structured knowledge (Wikidata) with LLM-based filtering is a viable, scalable strategy for low-resource NLP. It bridges the gap between expensive manual annotation and unreliable fully automatic generation.
LLM Judgement Capability: High-end LLMs (even smaller variants like GPT-5-mini) can effectively act as "judges" for annotation quality in under-represented languages, approaching human-level reliability.
Future Directions: The authors suggest this methodology can be extended to other low-resource languages with Wikipedia presence. They also note that while LLMs are excellent judges, fully automated generation of labels from scratch remains prone to systematic errors, making the "human-in-the-loop" or "LLM-as-judge" hybrid approach the most pragmatic path forward for equitable NLP.

Limitations: The approach relies on the existence of a populated Wikipedia edition and accurate Wikidata linking. It may struggle with languages having sparse digital footprints. Additionally, the dataset focuses on coarse-grained entities, potentially missing finer semantic distinctions.