SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Imagine you are walking into a massive library in Slovakia. This library contains nearly 230,000 student theses (like long school projects) covering everything from farming to engineering.

The problem? There is no index card system. If you want to find a paper about "Rural Development," you can't just search for those exact words. Why? Because in the Slovak language, words change their shape depending on how they are used in a sentence.

The Author's Label: The student writes the key phrase as "Rozvojový potenciál" (Development potential).
The Text's Shape: Inside the essay, the author might write "rozvojového potenciálu" (of the development potential).

To a computer, these look like two completely different words. It's like trying to find a book labeled "Cat" when the text inside only says "Kitty," "Cats," or "Feline."

This paper, called SlovKE, is a massive effort to fix this mess and teach computers how to understand Slovak better. Here is the breakdown in simple terms:

1. The Big Data Cleanup (The "Spring Cleaning")

Before this study, researchers only had a tiny, messy pile of about 9,000 Slovak documents. It was like trying to learn a language by reading a single page of a dictionary.

The authors went to the official Slovak government database and downloaded 794,000 documents. But it was a disaster zone:

Some were in English, not Slovak.
Some had the student's name and "Abstract" written inside the text, confusing the computer.
Some key phrases were just a giant block of text glued to the end of the paragraph.

They built a "digital janitor" to clean this up. They removed the noise, fixed the formatting, and ended up with a pristine, high-quality dataset of 227,432 documents. This is 25 times bigger than anything that existed before. It's like upgrading from a small garden hose to a firehose of data.

2. The Old Way vs. The New Way (The "Copy-Paste" vs. The "Translator")

The researchers tested two types of computer programs to see if they could find the right key phrases.

The Old Way (The "Copy-Paste" Robots):
They used standard tools (like YAKE and TextRank). These robots work by scanning the text and copying words exactly as they appear.

The Problem: If the text says "of the development potential," the robot copies "of the development potential."
The Result: The robot fails to match the author's label "Development potential." It's like a librarian who refuses to shelve a book unless the spine says exactly what the catalog card says, ignoring that the book is the same one.
The Score: These robots were terrible at matching exactly (only about 11% success), though they were okay at finding some matching words (about 51%).

The New Way (The "Smart Translator" Robot):
They used a Large Language Model (KeyLLM), which is like a very smart AI that can write its own answers.

The Superpower: Instead of copying words, the AI reads the whole essay, understands the meaning, and then writes the key phrase in its "dictionary form" (the clean, standard version).
The Result: Even if the text says "of the development potential," the AI writes "Development potential."
The Score: This robot did much better at exact matching (15%), narrowing the gap significantly. It proved that AI that generates text is better at handling languages where words change shape than AI that just copies text.

3. The Human Check (The "Teacher's Grading")

Computers can be tricky. The researchers asked humans to grade 100 of these essays to see what the computers were actually doing.

The Surprise: The "Copy-Paste" robots were often right about the topic, but the computer grading system marked them wrong because the spelling didn't match perfectly.
The AI's Strength: The Smart Translator (KeyLLM) was great at understanding the big picture. It could spot concepts the original author forgot to list as keywords (like "psychological consequences of obesity" in a paper about obesity).
The AI's Weakness: Sometimes, the Smart Translator got a bit too creative and pulled out random adjectives that didn't really mean much on their own.

Why Does This Matter?

This paper is a huge deal for three reasons:

It's a Gift to the World: They made this massive, clean dataset free for anyone to use. This is the foundation for building better Slovak AI in the future.
It Solves a "Shape-Shifter" Problem: It proves that for languages like Slovak, Czech, Polish, or Turkish (where words change form), you can't just use simple copy-paste tools. You need AI that understands grammar and can "normalize" words.
It Changes How We Measure Success: The paper shows that if you only look at "exact matches," you are unfairly punishing good AI models in these languages. We need new ways to grade them that understand that "cat," "kitty," and "feline" are all the same animal.

In a nutshell: The authors built a giant, clean library of Slovak school papers, showed that old computer tools struggle with the language's changing word shapes, and proved that smart AI that can "rewrite" words is the key to unlocking the true meaning of the text.

1. Problem Statement

The paper addresses the significant gap in Keyphrase Extraction (KPE) research for morphologically rich, low-resource languages, specifically focusing on Slovak.

The Core Challenge: In languages like Slovak (a West Slavic language), a single lemma (dictionary form) can appear in dozens of inflected forms (cases, numbers, genders) within a text.
The Evaluation Mismatch: Traditional extractive models retrieve surface tokens exactly as they appear in the text. However, authors typically assign keyphrases in their canonical (nominative) form. This creates a fundamental mismatch where a model correctly identifies a concept (e.g., rozvojového potenciálu - genitive) but fails an "exact match" evaluation against the author's assigned keyphrase (Rozvojový potenciál - nominative).
Data Scarcity: Prior to this work, Slovak KPE research was limited by small, noisy datasets (e.g., Zelinka, 2023, with ~9,000 documents), making it difficult to train robust models or establish reliable benchmarks comparable to English standards like KP20K.

2. Methodology

A. Dataset Construction (SlovKE)

The authors constructed SlovKE, a large-scale dataset derived from the Slovak Central Register of Theses.

Scale: The dataset contains 227,432 scientific abstracts with author-assigned keyphrases, representing a 25-fold increase over previous Slovak resources.
Cleaning Pipeline: A rigorous multi-stage cleaning process was applied to 793,722 scraped records:
1. Deduplication: Prioritizing records with complete abstracts and keyphrases.
2. Metadata Removal: Stripping noise (author names, thesis types, page counts) prepended to abstracts.
3. Keyphrase Recovery: Extracting keyphrases appended to the end of abstracts by universities lacking dedicated fields.
4. Language Verification: Using the lingua library to filter out English abstracts mislabeled as Slovak (20% of initial Slovak-labeled data).
5. Normalization: Using Stanza POS tagging to split concatenated lists and enforce a max length of 4 words.
6. Filtering: Retaining abstracts between 500–2,000 characters with 4–15 keyphrases.
Split: Randomly split into Training (80%), Validation (10%), and Test (10%).

B. Models Evaluated

The study benchmarks three categories of models:

Unsupervised Extractive Baselines:
- YAKE: Statistical method based on term frequency and position.
- TextRank: Graph-based method using PageRank.
- KeyBERT: Embedding-based method using SlovakBERT (kinit/slovakbert-sts-stsb) to measure cosine similarity.
- Note: All three are extractive, meaning they return surface tokens from the text, making them vulnerable to morphological mismatches.
Generative LLM Approach:
- KeyLLM: Uses GPT-3.5-turbo to generate keyphrases via prompting. Unlike extractive models, it can generate canonical forms regardless of the text's surface morphology.
- Clustering Strategy: The authors tested embedding-based clustering (using Sentence Transformers) to group similar documents and reduce API costs, finding that higher similarity thresholds (0.90) or no clustering yielded the best results.

C. Evaluation Metrics

To address the morphological challenge, the authors utilized two matching strategies:

Exact Match: Strict string matching (after lemmatization).
Partial Match: Counts a match if any fragment of the extracted keyphrase overlaps with the gold standard.
Metric: F1@k (specifically F1@6).
Manual Evaluation: A subset of 100 documents was manually annotated to assess semantic relevance beyond automated matching, measuring inter-annotator agreement ( $\kappa = 0.61$ ).

3. Key Contributions

SlovKE Dataset: The release of the largest Slovak KPE dataset (227K documents), comparable in scale and statistical properties to English benchmarks like KP20K.
Quantification of Morphological Bias: The study provides empirical evidence that the gap between Exact Match and Partial Match scores is a diagnostic metric for morphological richness. In Slovak, this gap is ~40 points, whereas it is minimal in English.
LLM Superiority in Morphology: Demonstration that generative models (KeyLLM) significantly outperform extractive baselines in handling morphological variability by generating canonical forms.
Error Analysis: Identification of specific failure modes:
- Extractive models: Fail due to morphological mismatch (surface form vs. canonical form).
- Generative models: Fail due to "unmotivated adjective extraction" (extracting adjectives without nouns).

4. Results

Baseline Performance (Extractive Models)

Exact Match F1@6: Very low, peaking at 11.6% (YAKE).
Partial Match F1@6: Much higher, peaking at 51.5% (TextRank).
The Gap: The ~40-point difference indicates that standard exact-match metrics severely underestimate the performance of extractive models in inflected languages because they penalize correct concepts expressed in different grammatical cases.

KeyLLM Performance

Exact Match F1@6: Achieved ~15.2%, a substantial improvement over the best baseline (YAKE at 11.6%).
Partial Match F1@6: Achieved ~49.1%, comparable to the best baselines.
Gap Reduction: KeyLLM narrowed the exact-partial gap by roughly 30% compared to YAKE. This confirms that generative models can produce the canonical lemma forms required for better evaluation scores.

Manual Evaluation Insights

Semantic Relevance: Manual evaluation revealed that automated metrics (even partial match) miss semantically relevant concepts. KeyLLM successfully extracted concepts like named entities and methodologies that were discussed in the text but omitted from the author's keyphrase list.
Failure Modes:
- YAKE: Struggled with word order and inflection; increasing n-gram range to (1,3) degraded performance by prioritizing irrelevant trigrams.
- KeyLLM: Prone to extracting standalone adjectives lacking context, which lowered precision as the number of extracted keyphrases ( $k$ ) increased.

5. Significance and Conclusion

Benchmarking Standard: SlovKE establishes a new standard for evaluating KPE in Slavic and other morphologically rich languages, proving that exact-match metrics are insufficient and that partial matching or manual evaluation is necessary for fair assessment.
Generative vs. Extractive: The paper provides strong evidence that generative LLMs are better suited for morphologically rich languages than traditional extractive methods because they can normalize surface forms to canonical lemmas without explicit morphological analyzers.
Future Directions: The dataset enables future supervised training (fine-tuning SlovakBERT or generative models) and cross-lingual transfer to related languages (Czech, Polish).
Availability: The dataset and code are publicly available on Hugging Face and GitHub, fostering further research in low-resource Slavic NLP.

In summary, this work bridges the gap between the theoretical potential of LLMs and the practical challenges of inflected languages, providing the infrastructure (SlovKE) and empirical evidence needed to advance keyphrase extraction in the Slavic linguistic family.