CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Imagine you have a giant, dusty library filled with old Czech newspapers and books from the 1800s. You want to find every single mention of "labor strikes" or "rain shortages" without reading every single word yourself.

This paper introduces a new tool and a new challenge called Topic Localization. Think of it not just as finding a needle in a haystack, but as drawing a precise circle around the needle and explaining exactly why it's a needle, even if the needle is buried under a pile of hay that looks very similar.

Here is the breakdown of the paper using simple analogies:

1. The Problem: Finding the "Where" and "Why"

Most computer programs are good at saying, "Yes, this whole document is about strikes." But they are terrible at saying, "Here are the exact three sentences in the middle of page 4 that talk about the strike, and here are the two words on page 7 that mention it again."

The Analogy: Imagine a teacher grading an essay.
- Old way (Document Classification): The teacher says, "This essay is about 'The Great War'." (True, but vague).
- New way (Topic Localization): The teacher uses a highlighter to mark exactly which sentences discuss the war, which ones discuss the peace treaty, and which ones are just about the weather. They also allow for overlapping highlights (a sentence can be about both war and peace).

2. The New Benchmark: "CzechTopic"

The authors created a massive test set called CzechTopic. They took historical Czech documents and asked human experts to play the "highlighter game."

The Setup: They gave the humans a topic definition (e.g., "Labor Disputes: strikes, wage demands, owner conflicts").
The Task: The humans had to read the text and highlight every single word that fit that definition.
The Twist: They didn't just ask one person to do it. They asked many people. Why? Because sometimes one person thinks a sentence is about a strike, and another thinks it's just about "work." The paper argues that human disagreement is normal, and computers should be judged against the average human agreement, not just one "perfect" answer.

3. The Contenders: The "Big Brains" vs. The "Specialists"

The researchers tested two types of AI models to see who could play the highlighter game best:

The Large Language Models (LLMs): Think of these as Olympic-level generalists. They are huge, smart, and know almost everything. They can chat, write poems, and code. The researchers asked them, "Can you highlight the parts about strikes?"
The BERT Models: Think of these as specialized apprentices. They are smaller, cheaper, and trained specifically on this exact task using a "distilled" dataset (a massive practice set created by an AI mimicking humans).

4. The Results: Who Won?

The results were a mix of impressive and disappointing:

The Generalists (LLMs): The biggest, smartest models (like GPT-5) got very close to human performance. They were great at finding the topic (e.g., "Yes, this page has strikes"). However, when it came to drawing the exact boundaries of the highlighted text, they often made mistakes. They were a bit sloppy with the edges.
- Analogy: A brilliant professor can tell you the essay is about war, but they might accidentally highlight a sentence about the weather because it sounded dramatic.
The Specialists (BERT): Surprisingly, the smaller, specialized models did incredibly well. They weren't as smart as the giants, but because they were trained specifically to "highlight," they were very consistent. In some cases, they outperformed the smaller, less capable LLMs.
- Analogy: A specialized highlighter-wielding intern might not know the history of the war, but they are very good at following the rule: "Highlight every word about wages."

5. The Big Takeaway

The paper concludes that while AI is getting scary good at understanding text, precision is still hard.

Even the best AI models struggle to match the consistency of a group of humans.
The way we ask the AI questions (the "prompt") matters less than how it is asked to output the answer (e.g., asking it to match words vs. tagging them).
The "Human Baseline": The most important finding is that humans don't always agree with each other perfectly. Therefore, we shouldn't expect AI to be perfect either. If an AI performs as well as the average human agreement, that's a huge success.

Summary in One Sentence

The authors built a new "highlighting" test using old Czech books to show that while AI is getting great at finding topics, it still struggles to draw the perfect boundaries around them, and the best way to judge it is by comparing it to a crowd of humans rather than a single "perfect" answer.

Here is a detailed technical summary of the paper "CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents."

1. Problem Definition: Topic Localization

The paper addresses Topic Localization, a task defined as identifying the exact spans of text within a document that express a specific topic. Unlike traditional document classification (which labels the whole document) or topic segmentation (which partitions text into contiguous blocks), topic localization presents unique challenges:

Granularity: It requires word-level boundary decisions.
Structure: Spans can be overlapping, non-contiguous, and disjoint.
Multiplicity: A single topic may appear multiple times in a document.
Ambiguity: The task involves abstract categories with "fuzzy" boundaries, leading to potential disagreement even among human annotators.

The authors argue that standard evaluation metrics (comparing against a single "gold" reference) are insufficient for this task because they ignore inter-annotator variability. Therefore, a new benchmark and evaluation framework grounded in human agreement are necessary.

2. Methodology

A. Dataset Construction (CzechTopic)

The authors introduce a new human-annotated dataset derived from historical Czech documents (digitized books and periodicals).

Scale: 525 texts, 363 human-defined topics, and 1,820 annotated (text, topic) pairs.
Data Preparation:
- Texts were extracted using PERO-OCR and segmented into 768–1024 character chunks.
- Semantic Clustering: Texts were embedded (using a fine-tuned Gemma 2 model) and clustered via K-means. This ensures that annotators and models must distinguish subtle topical differences within semantically similar groups, increasing task difficulty.
Two-Phase Human Annotation:
1. Phase 1 (Topic Definition): Annotators review a cluster of 5 texts and propose 2–5 recurring topics with names and descriptions. They then localize these topics.
2. Phase 2 (Localization Agreement): Multiple annotators localize the same predefined topics across the texts without seeing previous annotations. This allows for the measurement of inter-annotator agreement on span boundaries.

B. Distilled Development Dataset

To facilitate model training at scale (as human annotation is costly), the authors created a large distilled development dataset (15,550 texts, 187,773 pairs) using GPT-5-mini.

The LLM followed the same two-phase human protocol: generating topics for a cluster and then localizing them.
This dataset serves as the training ground for fine-tuned models.

C. Models Evaluated

The study benchmarks two categories of models:

Fine-tuned BERT-based Models: Cross-encoder architectures (inspired by GLiNER) fine-tuned on the distilled dataset. These models jointly encode the topic description and the text to compute token-level similarity scores.
Large Language Models (LLMs): A diverse range of open and commercial models (e.g., GPT-5 variants, Llama 3, Gemma 3, Gemini).
- Strategies: Tested in zero-shot and few-shot settings.
- Output Paradigms: Compared Tagging (inserting markers in text) vs. Matching (generating span content for post-processing string matching).

D. Evaluation Protocol

Metrics: Precision, Recall, F1, and Intersection over Union (IoU) at both text-level (topic presence) and word-level (span localization).
Baseline: Instead of a single gold standard, the human baseline is calculated as the average pairwise agreement among all human annotators. Model performance is measured against every human annotator and averaged.

3. Key Contributions

CzechTopic Benchmark: The first dataset for topic localization in historical Czech documents, featuring open-ended topic definitions and fine-grained span annotations.
Evaluation Framework: A novel protocol that evaluates models relative to inter-annotator agreement rather than a single reference, acknowledging the subjectivity of the task.
Distilled Training Data: A large-scale, automatically generated dataset that enables the training of specialized models for this specific task.
Comprehensive Benchmarking: A systematic comparison of LLMs (zero-shot/few-shot) and fine-tuned BERT models, analyzing performance across different model sizes and prompting strategies.

4. Results

Human Performance

Human annotators show high consistency, with an inter-annotator agreement (Krippendorff's $\alpha$ ) of 0.616 (micro) and 0.592 (macro).
Individual human F1 scores range from 66.4 to 72.1 at the word level.

Model Performance

LLM Variability: Performance varies drastically among LLMs.
- Best LLM: GPT-5-2 achieves a word-level F1 of 61.1 and text-level F1 of 80.6. While strong, it remains statistically significantly below the human baseline (68.7 F1, 83.2 F1).
- Worst LLM: Smaller models (e.g., GPT-5-nano) fail significantly, with word-level F1 as low as 13.2.
- Gap: There is a 47.9 percentage-point gap between the best and worst LLMs.
BERT vs. LLM:
- Fine-tuned BERT models (e.g., robeczech) achieve word-level F1 scores up to 48.3.
- While lower than the best LLMs, these specialized models outperform several smaller LLMs and demonstrate that architectures trained specifically for localization are competitive baselines.
Ablation Studies:
- Strategy: The Matching paradigm significantly outperforms Tagging (+0.104 F1).
- Prompting: Few-shot prompting offers only marginal gains (+0.010 F1) over zero-shot.
- Language: The language of the prompt (Czech vs. English) has no statistically significant impact.

Key Insight on Human Agreement

The study found that human annotators in Phase 2 (localizing fixed topics) agreed with each other more strongly than they agreed with the original topic author (Phase 1). This suggests that written topic descriptions do not fully capture the implicit intent of the topic creator, leading to systematic divergence in interpretation.

5. Significance

Task Definition: The paper establishes Topic Localization as a distinct and challenging NLP task, separate from segmentation or extraction, requiring models to handle overlapping and non-contiguous spans.
Evaluation Paradigm: It challenges the standard "single gold reference" evaluation in favor of human-agreement-based metrics, providing a more realistic assessment of model capabilities in ambiguous tasks.
Historical NLP: By focusing on historical Czech documents, the work supports digital humanities and historical research, enabling tools for evidence extraction and corpus navigation.
Model Insights: The results indicate that while current LLMs are approaching human-level performance in detecting topic presence, precise span localization remains a significant hurdle, even for the most advanced models. Specialized, fine-tuned architectures remain highly relevant for this specific task.

The dataset and code are publicly available at https://github.com/dcgm/czechtopic.