CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Imagine you are trying to learn a new language, like English. You open a giant, super-detailed dictionary called WordNet. It's like a massive library where every single word is connected to every other word it's related to. It's amazing for experts, but for a beginner? It's overwhelming.

Think of WordNet like a giant, unsorted toolbox. If you need a "hammer," the toolbox has 50 different types of hammers: a tiny one for jewelry, a heavy one for construction, a rubber one for delicate work, and so on. A beginner just needs to know what a "hammer" is for basic nails. Seeing 50 variations confuses them and makes them feel stupid.

This paper is about organizing that messy toolbox so it's friendly for learners at every stage of their journey.

Here is the story of how they did it, broken down into simple steps:

1. The Problem: Too Much Information, Too Soon

The researchers realized that WordNet doesn't care if you are a beginner or a master. It treats the word "bank" (the place you keep money) and "bank" (the side of a river) with the same complexity. A beginner learning English doesn't need to know the river meaning yet; it's too hard. They need to know the money meaning first.

2. The Solution: The "CEFR" Ladder

They used a standard called CEFR (Common European Framework of Reference). Imagine this as a video game level system:

A1/A2: Beginner (Level 1-2)
B1/B2: Intermediate (Level 3-4)
C1/C2: Advanced/Expert (Level 5-6)

The goal was to tag every single definition in WordNet with a "Level" so a learner could say, "Show me only the Level 1 definitions for this word."

3. The Magic Tool: The AI "Translator"

They couldn't ask a human to read 155,000 words and tag them; that would take forever and cost a fortune. Instead, they used a Large Language Model (AI) as a super-smart librarian.

Here's how the AI did the work:

The Source Material: They had a "Gold Standard" list (called the English Vocabulary Profile) where humans had already decided which words are easy or hard.
The Match-Up: The AI looked at the definition of a word in WordNet and compared it to the definition in the Gold Standard list.
The Analogy: Imagine the AI is a matchmaker. It looks at a description from the Gold Standard (e.g., "a place where you keep money") and a description from WordNet (e.g., "a financial institution"). The AI asks, "Are these the same meaning?" If the AI says, "Yes, they are twins!" it copies the "Level 1" tag from the Gold Standard and sticks it onto the WordNet entry.

4. The Result: A New, Friendly Dictionary

They created a CEFR-Annotated WordNet.

Now, if you look up "bank," the system knows that the "money" meaning is Level 1 (Easy) and the "river" meaning is Level 4 (Hard).
They also built a Corpus (a giant collection of sentences) where every word is tagged with its difficulty level. It's like a library where every book has a sticker saying "Read this if you are Level 2."

5. The Test: Can the AI Teach?

To make sure their new system actually works, they built a Teacher Bot (a classifier).

They gave the bot a sentence and asked, "What level is this word?"
They tested the bot on sentences it had never seen before.
The Score: The bot got an 81% accuracy score. This is huge! It means the AI successfully learned the rules of difficulty just by looking at the data they created.

Why This Matters (The "So What?")

For Students: It reduces "cognitive load." Instead of drowning in 50 definitions, a student sees the 2 or 3 they actually need for their current level. It's like having a personalized tutor that only shows you the right puzzle pieces for your current skill.
For Teachers: They can automatically generate reading materials that are perfectly tuned to their students' abilities.
For Tech: It bridges the gap between "Computer Science" (which loves big data) and "Language Learning" (which needs human understanding).

In a Nutshell

The researchers took a chaotic, expert-only dictionary, used a smart AI to sort every word by difficulty level, and proved that this new system helps computers understand how hard a word is for a human. It turns a giant, confusing wall of text into a staircase, letting learners climb up one step at a time without falling off.

Here is a detailed technical summary of the paper "CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning."

1. Problem Statement

WordNet, a foundational lexical database for Natural Language Processing (NLP), organizes words into hierarchical semantic networks. However, it is ill-suited for Second Language (L2) learners due to two primary issues:

Overly Fine-Grained Senses: WordNet distinguishes senses with high granularity, often creating cognitive overload for learners who must identify the specific sense appropriate for their proficiency level.
Lack of Proficiency Indicators: WordNet does not map its senses to the Common European Framework of Reference for Languages (CEFR) (A1–C2), making it difficult to determine which vocabulary items are appropriate for beginners versus advanced learners.

Existing resources either lack sense-level granularity (assigning levels to whole words rather than specific senses) or are too small to be practical for large-scale NLP applications. There is a need for a resource that bridges NLP semantic networks with educational proficiency standards.

2. Methodology

The authors propose an automated pipeline to annotate WordNet senses with CEFR levels using Large Language Models (LLMs). The process involves three main stages:

A. Data Alignment and Annotation (The Pipeline)

Source Extraction:
- Target: WordNet glosses (definitions) for specific <word, Part-of-Speech> pairs.
- Reference: The English Vocabulary Profile (EVP) Online, which provides CEFR levels for specific word senses along with their glosses.
Semantic Similarity Measurement:
- An LLM (GPT-4o) is prompted to compare a WordNet gloss ( $g'$ ) with an EVP gloss ( $g$ ).
- The LLM rates the similarity on a 7-point scale (1 = "Exactly the same" to 7 = "Completely different").
- Thresholding: A conservative threshold of $\le 2$ is used. If the score is 1 or 2, the CEFR level of the EVP gloss is transferred to the WordNet sense.
Resulting Resource: This process generates a CEFR-Annotated WordNet covering 10,644 senses across 5,645 lemmas.

B. Corpus Construction

The authors mapped these new CEFR annotations onto the SemCor 3.0 corpus (a widely used sense-annotated corpus).
This resulted in the SemCor-CEFR corpus, containing over 110,000 sense-level annotations with CEFR levels, significantly expanding the available data for training language models.

C. Validation via Classification

To verify the quality of the automatically generated labels without manual expert review, the authors trained Contextual Lexical CEFR-Level Classifiers.

Task: Predict the CEFR level of a target word given its context sentence.
Models Tested:
- Baseline: ME6 Contextual (BERT-based SVC).
- LLM Approaches: Zero-shot, Few-shot (6-shot, 18-shot), and Fine-tuned (GPT-4.1 mini).
- Hybrid Approach (FT+KB): Combining fine-tuned LLMs with a Knowledge Base (KB) for words with unambiguous, single-level CEFR definitions.

3. Key Contributions

CEFR-Annotated WordNet: A new resource linking 10,644 WordNet senses to CEFR levels, covering ~80% of single-word senses in the EVP.
Prompt-Only LLM Annotation: A novel, low-cost, and reproducible method for semantic annotation using LLMs to measure gloss similarity, eliminating the need for labor-intensive manual annotation.
SemCor-CEFR Corpus: A large-scale corpus (>110k instances) integrating sense disambiguation with proficiency levels, bridging NLP and educational technology.
High-Performance Classifiers: Development of classifiers that achieve high accuracy in predicting CEFR levels from context, validating the quality of the automated annotations.

4. Experimental Results

The study evaluated classifiers trained on various datasets (EVP only, SemCor-CEFR only, and a Mixture of both).

Performance Metrics:
- The Fine-Tuned + Knowledge Base (FT+KB) model trained on the Mixture dataset achieved the best performance with a Macro-F1 score of 0.81.
- This outperformed the baseline ME6 Contextual model (Macro-F1 ~0.61) and zero-shot/few-shot LLMs (Macro-F1 < 0.50).
Generalizability:
- Models trained on the SemCor-CEFR corpus showed a strong correlation (Spearman $\rho \approx 0.54$ ) with the CompLex 2.0 dataset (a lexical complexity benchmark), whereas models trained solely on EVP data showed weak correlation. This suggests the SemCor-CEFR corpus captures broader, genre-independent complexity patterns.
Error Analysis:
- Misclassifications were predominantly adjacent-level errors (e.g., predicting B2 for a true B1), which are less disruptive for learners than non-adjacent errors.
- The hybrid approach (FT+KB) significantly improved accuracy for words with unambiguous levels by bypassing LLM inference for known entries.

5. Significance and Implications

Bridging NLP and Education: The work successfully integrates structured semantic networks (WordNet) with pedagogical standards (CEFR), enabling the creation of adaptive learning tools.
Scalability: The automated LLM-based annotation pipeline allows for the rapid expansion of proficiency-annotated resources to other languages or dictionaries without prohibitive manual costs.
Practical Application: The resulting classifiers and corpus can power:
- Adaptive Vocabulary Learning: Presenting words/senses matched to a learner's current level.
- Text Simplification: Identifying complex lexical items in texts for scaffolding.
- Automated Assessment: Estimating the proficiency level of learner-generated text.
Resource Availability: All resources (Annotated WordNet, SemCor-CEFR corpus, and classifiers) are publicly available, fostering further research in Computer-Assisted Language Learning (CALL).

In conclusion, this paper demonstrates that LLMs can effectively automate the alignment of complex lexical databases with educational proficiency standards, creating high-quality, large-scale resources that significantly enhance the efficiency of language learning technologies.