Structure-Aware Text Recognition for Ancient Greek Critical Editions

This paper addresses the limitations of visual language models in recognizing the complex layouts of Ancient Greek critical editions by introducing a large-scale synthetic corpus and a real-world benchmark, demonstrating that while zero-shot performance lags behind traditional tools, fine-tuned models like Qwen3VL-8B can achieve state-of-the-art accuracy.

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a stack of ancient, dusty books written in a language that hasn't been spoken for thousands years: Ancient Greek. But these aren't just simple stories; they are critical editions. Think of them as the "ultimate annotated versions" of these texts.

If you open a modern novel, it's just text. But open one of these ancient books, and it looks like a chaotic city map. The main story is there, but it's surrounded by:

  • Marginal notes: Tiny comments in the side margins explaining difficult words.
  • Reference markers: Little numbers or letters scattered everywhere that point to other parts of the book (like "see page 42, line 5").
  • Complex layouts: Text that jumps between columns, headers that look like footers, and symbols that act as both letters and numbers.

The Problem:
For decades, computers have been terrible at reading these. Traditional "Optical Character Recognition" (OCR) software is like a very literal robot. It sees a squiggle and says, "That's an 'A'." But it gets confused by the layout. It might read a footnote as part of the main story, or miss a crucial reference marker entirely. It's like trying to read a newspaper while someone keeps shuffling the columns and pasting sticky notes all over the page.

The Solution (The Paper's Big Idea):
The researchers in this paper asked: What if we gave the computer a brain that understands not just the letters, but the "skeleton" of the page?

They used a new type of AI called a Vision-Language Model (VLM). Think of these models as super-smart students who have read millions of books and can "see" an image and understand the story it tells, rather than just matching shapes to letters.

Here is how they did it, broken down into simple steps:

1. Building a "Training Gym" (Synthetic Data)

You can't teach a student to read ancient Greek just by showing them a few real books; there aren't enough, and they are too messy. So, the researchers built a giant video game.

  • They took thousands of clean, digital versions of these ancient texts.
  • They programmed a computer to "print" them out in millions of different ways, just like a printer that changes fonts, paper sizes, and column layouts every time it prints.
  • They created 185,000 fake pages. This is their "training gym." The AI practiced on these fake pages, learning to spot the difference between a main paragraph and a side note, even when the layout looked weird.

2. The "Real World" Test (The Benchmark)

After the AI practiced on the fake pages, they tested it on 450 real, scanned pages from actual ancient books. These were the "final exams." The books ranged from the 5th century BC to the 14th century AD, covering hundreds of years of different printing styles.

3. The Results: A Tale of Two Models

They tested three different "students" (AI models):

  • The Old Guard (Traditional OCR): These are the reliable, old-school robots. They are okay at reading the letters but often get lost in the layout. They are like a librarian who can read the words but can't find the right shelf.
  • The New Kids (Vision-Language Models):
    • Some of the new AI models were confused. They tried to be too creative, "hallucinating" (making up) words or getting the layout completely wrong. It's like a student who knows the vocabulary but writes a completely different story than the one on the page.
    • The Star Performer (Qwen3-VL-8B): This model was the standout. When they gave it the "training gym" data first, and then a little bit of real data, it became a master.
      • It achieved a 99% accuracy rate on the letters.
      • Crucially, it didn't just read the words; it understood the structure. It knew which text was a title, which was a footnote, and which was a reference marker.

Why Does This Matter?

Imagine you are a historian trying to study a text.

  • Before: You had to manually type out the book, then manually type out all the footnotes, then manually link the references. It took months.
  • Now: This AI can look at a scanned page and output a digital file where the text, the footnotes, and the references are already separated and organized correctly.

The Catch:
The researchers found that while these super-smart AI models are amazing, they are also heavy and expensive to run (like driving a massive truck to deliver a single letter). Sometimes, a smaller, simpler tool (the old-school robot) is actually more stable and less likely to make up fake facts.

The Bottom Line

This paper is a breakthrough because it proves that AI can finally "see" the structure of complex, ancient books, not just the words. It's like teaching a computer to understand that a footnote is a footnote, and a chapter title is a title, even when the page looks like a chaotic mess. This opens the door to digitizing thousands of ancient texts automatically, preserving them for future generations in a way that is both accurate and easy to search.