Structure-Aware Text Recognition for Ancient Greek Critical Editions

Imagine you have a stack of ancient, dusty books written in a language that hasn't been spoken for thousands years: Ancient Greek. But these aren't just simple stories; they are critical editions. Think of them as the "ultimate annotated versions" of these texts.

If you open a modern novel, it's just text. But open one of these ancient books, and it looks like a chaotic city map. The main story is there, but it's surrounded by:

Marginal notes: Tiny comments in the side margins explaining difficult words.
Reference markers: Little numbers or letters scattered everywhere that point to other parts of the book (like "see page 42, line 5").
Complex layouts: Text that jumps between columns, headers that look like footers, and symbols that act as both letters and numbers.

The Problem:
For decades, computers have been terrible at reading these. Traditional "Optical Character Recognition" (OCR) software is like a very literal robot. It sees a squiggle and says, "That's an 'A'." But it gets confused by the layout. It might read a footnote as part of the main story, or miss a crucial reference marker entirely. It's like trying to read a newspaper while someone keeps shuffling the columns and pasting sticky notes all over the page.

The Solution (The Paper's Big Idea):
The researchers in this paper asked: What if we gave the computer a brain that understands not just the letters, but the "skeleton" of the page?

They used a new type of AI called a Vision-Language Model (VLM). Think of these models as super-smart students who have read millions of books and can "see" an image and understand the story it tells, rather than just matching shapes to letters.

Here is how they did it, broken down into simple steps:

1. Building a "Training Gym" (Synthetic Data)

You can't teach a student to read ancient Greek just by showing them a few real books; there aren't enough, and they are too messy. So, the researchers built a giant video game.

They took thousands of clean, digital versions of these ancient texts.
They programmed a computer to "print" them out in millions of different ways, just like a printer that changes fonts, paper sizes, and column layouts every time it prints.
They created 185,000 fake pages. This is their "training gym." The AI practiced on these fake pages, learning to spot the difference between a main paragraph and a side note, even when the layout looked weird.

2. The "Real World" Test (The Benchmark)

After the AI practiced on the fake pages, they tested it on 450 real, scanned pages from actual ancient books. These were the "final exams." The books ranged from the 5th century BC to the 14th century AD, covering hundreds of years of different printing styles.

3. The Results: A Tale of Two Models

They tested three different "students" (AI models):

The Old Guard (Traditional OCR): These are the reliable, old-school robots. They are okay at reading the letters but often get lost in the layout. They are like a librarian who can read the words but can't find the right shelf.
The New Kids (Vision-Language Models):
- Some of the new AI models were confused. They tried to be too creative, "hallucinating" (making up) words or getting the layout completely wrong. It's like a student who knows the vocabulary but writes a completely different story than the one on the page.
- The Star Performer (Qwen3-VL-8B): This model was the standout. When they gave it the "training gym" data first, and then a little bit of real data, it became a master.
  - It achieved a 99% accuracy rate on the letters.
  - Crucially, it didn't just read the words; it understood the structure. It knew which text was a title, which was a footnote, and which was a reference marker.

Why Does This Matter?

Imagine you are a historian trying to study a text.

Before: You had to manually type out the book, then manually type out all the footnotes, then manually link the references. It took months.
Now: This AI can look at a scanned page and output a digital file where the text, the footnotes, and the references are already separated and organized correctly.

The Catch:
The researchers found that while these super-smart AI models are amazing, they are also heavy and expensive to run (like driving a massive truck to deliver a single letter). Sometimes, a smaller, simpler tool (the old-school robot) is actually more stable and less likely to make up fake facts.

The Bottom Line

This paper is a breakthrough because it proves that AI can finally "see" the structure of complex, ancient books, not just the words. It's like teaching a computer to understand that a footnote is a footnote, and a chapter title is a title, even when the page looks like a chaotic mess. This opens the door to digitizing thousands of ancient texts automatically, preserving them for future generations in a way that is both accurate and easy to search.

1. Problem Statement

The paper addresses the challenge of Optical Character Recognition (OCR) for Ancient Greek critical editions. Unlike modern documents or standard historical texts, these editions present unique difficulties:

Complex Layouts: They feature dense, heterogeneous page structures with interleaved main text, section hierarchies, milestone numbering, and extensive marginal annotations (apparatus criticus).
Structural Dependency: Scholarly citation and navigation rely heavily on recovering not just the text, but the specific document structure (e.g., distinguishing between a paragraph break and a marginal note).
Data Scarcity: There is a lack of open, machine-actionable corpora. Major resources like the Thesaurus Linguae Graecae (TLG) are licensed and cannot be used for training. Existing open corpora are small and often lack the structural complexity of printed critical editions.
Limitations of Current Models: Traditional modular OCR systems (layout analysis $\rightarrow$ segmentation $\rightarrow$ recognition) are brittle with complex layouts. While Vision-Language Models (VLMs) show promise, their ability to interpret the specific structural semantics of Ancient Greek critical editions remains unexplored and often underperforms compared to established tools in zero-shot settings.

2. Methodology

A. Resource Construction

The authors created two complementary datasets to train and evaluate models:

Large-Scale Synthetic Corpus (185,000 images):
- Source: Generated from openly licensed TEI/XML sources (Ancient Greek and Latin texts).
- Pipeline: A rendering pipeline converts TEI/XML into typographically diverse PDFs using varied LaTeX configurations (fonts, columns, page sizes, colors).
- Variation: Introduces controlled typographic and layout variations to simulate centuries of printing practices.
- Annotation: Outputs are paired with structured pseudo-Markdown targets containing inline tags (<ref>, <note>, # for headers, <tab/> for paragraph indentation).
Curated Real-World Benchmark (450 images):
- Source: Scanned pages from 30 author-work pairs spanning 1,800 years of Greek literature (5th c. BCE to 14th c. CE).
- Diversity: Includes editions from various traditions (Sources Chrétiennes, Oxford/Clarendon, Patrologia Graeca, etc.) published between 1844 and 2017.
- Annotation: Manually annotated by experts following the same schema as the synthetic data, ensuring high-quality ground truth for structure and text.

B. Experimental Setup

Models Evaluated: Three state-of-the-art Vision-Language Models (VLMs):
- Qwen3-VL (2B and 8B parameters)
- DeepSeek-OCR-2 (3B)
- LightOnOCR-2 (1B)
Baselines: Tesseract (zero-shot) and Kraken (trained/fine-tuned).
Training Regimes:
1. Zero-shot: Direct inference without fine-tuning.
2. Synthetic-only: Fine-tuned on the 185k synthetic images.
3. Real-only: Fine-tuned on the 450 real scanned images.
4. Synth $\rightarrow$ Real: Sequential fine-tuning (pre-trained on synthetic, continued on real data).
Evaluation Metrics:
- Text Accuracy: Character Error Rate (CER) and Word Error Rate (WER), stripped of markup.
- Structure Recognition: F1 scores for reference markers (<ref>), marginal notes (<note>), and headers (#). Specific metrics for paragraph indentation (<tab/>).

3. Key Contributions

New Resources: The release of the first large-scale synthetic corpus (185k images) and a curated real-world benchmark specifically for Ancient Greek critical editions, addressing the data bottleneck.
Synthetic Data Pipeline: A reusable framework for generating structure-aware synthetic data from TEI/XML, adaptable to other languages and document types.
Comprehensive Benchmarking: A rigorous evaluation of modern VLMs against traditional OCR tools in a highly structured, polytonic script domain.
Structured Output Schema: A lightweight pseudo-Markdown scheme that encodes document hierarchy, references, and marginalia, enabling end-to-end structure-aware recognition.

4. Results

Text Recognition Performance

Zero-Shot Limitations: Most VLMs significantly underperformed compared to Tesseract and Kraken in zero-shot settings, often failing to handle the dense layout and polytonic scripts.
Fine-Tuning Impact: Fine-tuning drastically improved performance.
- Qwen3-VL-8B achieved the best results, reaching a median Character Error Rate (CER) of 1.0% on real scans under the Synth $\rightarrow$ Real regime.
- Synthetic-only training transferred effectively to real data (e.g., Qwen3-VL-8B dropped from 5.2% to 1.7% CER without seeing real images), but sequential training yielded the lowest error rates.
Stability: While Qwen3-VL-8B showed high median accuracy, some models (like DeepSeek-OCR-2) exhibited high variance between median and mean errors, indicating "catastrophic" page-level failures where the model hallucinated long, fluent but incorrect text.

Structure Recognition Performance

References: Models performed well on reference markers (<ref>), with Qwen3-VL-8B achieving an F1 of 79.5%.
Marginal Notes: Detection of marginal notes (<note>) was highly sensitive to training data. Synthetic-only training yielded near-zero performance due to the low frequency of notes in the synthetic set, but real-domain exposure improved this significantly (up to 63.5% F1).
Headers: Header detection remained challenging due to style variability, though Qwen models achieved F1 scores above 75%.
Indentation: Qwen models successfully learned to distinguish paragraph continuations from new paragraphs using the <tab/> tag, whereas specialized OCR models failed to predict indentation entirely.

Error Analysis

Error Types: The best-performing models primarily made orthographic errors (diacritic confusion, breathing marks) rather than hallucinations.
Hallucinations: Poorer models suffered from "generative drift," producing spurious HTML/LaTeX tags, Latin characters, or hallucinated Greek text, often driven by over-reliance on language priors rather than visual evidence.

5. Significance and Conclusion

Proof of Concept: The study demonstrates that Vision-Language Models, when combined with structure-aware synthetic supervision and real-domain adaptation, can achieve near-perfect transcription and high-fidelity structure recovery for complex historical documents.
Beyond Transcription: The work highlights that for scholarly documents, OCR is not just about character recognition but interpretation of layout semantics. High text accuracy does not guarantee structural competence.
Trade-offs: While VLMs offer superior adaptability and cross-domain transfer, they come with high computational costs and risks of catastrophic generative failures. Traditional CRNN-based systems remain competitive in efficiency and stability for pure text recognition.
Future Direction: The authors suggest hybrid approaches that leverage the structural sensitivity of VLMs with the efficiency of traditional OCR pipelines as the optimal path forward for digitizing historical critical editions.