Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

This paper introduces the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and multiple writing systems, which enables quantitative analysis of major linguistic shifts in Korean and serves as a foundational resource for training large language models.

Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine the Korean language as a massive, ancient library that has been slowly changing its architecture for over a thousand years. For centuries, the books were written in a foreign language (Chinese characters), then they started mixing that foreign script with a new native alphabet (Hangul), and eventually, they switched almost entirely to the native alphabet.

The problem? Most modern computer programs (AI) trying to read this library only know how to read the newest books. They get completely confused by the old, mixed-up, or foreign-looking pages. Until now, there was no single, open map to help them navigate this history.

Enter the "Open Korean Historical Corpus" (OKHC).

Think of this paper as the announcement of a massive, free, digital time machine. The researchers have gathered 17.7 million documents (like 5.1 billion words) from the 7th century all the way to 2025. They didn't just grab modern news; they dug up ancient royal diaries, old legal codes, colonial-era newspapers, and even North Korean state media.

Here is a simple breakdown of what they did and what they found, using some everyday analogies:

1. The "Time-Traveling Bookshelf"

Before this project, if a researcher wanted to study how Korean changed over time, they had to visit 19 different libraries, ask for permission, and hope the books were even digitized. It was like trying to build a puzzle where the pieces were scattered across the world in locked boxes.

The team collected all these pieces and put them in one giant, open box.

  • The Collection: It includes texts in Classical Chinese (the "Latin" of East Asia), Idu (a tricky system where Chinese characters were used to write Korean grammar), Mixed Script (Hanja + Hangul), and pure Hangul.
  • The License: They made sure the "box" is open for everyone to use for research, provided they don't sell it for profit.

2. What the Data Revealed (The "Detective Work")

Once they had this massive dataset, they ran it through computers to see how the language evolved. Here are their three biggest discoveries:

  • The "Idu" Sunset:

    • The Analogy: Imagine a specific type of slang that was super popular in the 1800s but died out quickly.
    • The Finding: They found that a writing system called Idu (using Chinese characters to write Korean sounds) peaked in the 1860s. Then, it crashed. Why? Because the government passed new laws in the 1890s forcing people to use the native alphabet instead. The data shows this wasn't a slow fade; it was a sudden switch.
  • The "Script Switch" Speed Run:

    • The Analogy: Think of it like a country switching from driving on the left side of the road to the right. You might expect it to take decades of confusion, but in Korea, it happened fast.
    • The Finding: For hundreds of years, almost everything was written in Chinese characters. But starting around 1890, the switch to the Korean alphabet (Hangul) happened incredibly fast. By the 1980s, over 93% of the text was pure Hangul. The "mixed" era was a brief, intense transition period.
  • The "North vs. South" Vocabulary Gap:

    • The Analogy: Imagine two cousins who grew up in different countries. They speak the same language, but one cousin calls a "truck" a "lorry" and uses different words for "government." If you try to translate a letter from one to the other, your dictionary might get confused.
    • The Finding: Because North and South Korea have been separated for decades, their vocabularies have drifted apart. The researchers found that modern AI tools, trained on South Korean text, get 51 times more confused when reading North Korean news. They don't recognize words because North Korea uses unique spellings for foreign words (like calling Germany "Doichwolland" instead of the standard "Do-il-beun").

3. Why This Matters

Think of current AI models (like the ones powering chatbots) as students who only went to school in the last 20 years. They are great at reading modern text but fail when they encounter an old diary or a document with a mix of scripts.

This new corpus is like giving those students a history textbook.

  • For Historians: It allows them to study language changes with math and precision, rather than just guessing.
  • For AI Developers: It provides the "training data" needed to teach computers how to read ancient Korean, mixed scripts, and even North Korean dialects. This could help preserve culture and make AI smarter about the full history of the language, not just the modern version.

The Catch (Limitations)

The authors are honest about the flaws.

  • Survivorship Bias: The library is heavy on modern books because more of them survived and were digitized. The ancient books are scarcer, so the "time machine" is a bit blurry for the very distant past.
  • No Spoken Words: This is a library of written text. It doesn't include recordings of how people actually spoke, so we can't study the "sound" of history, only the "look" of it.

In a nutshell: This paper is the creation of a massive, open-access digital archive that lets us finally see the entire story of the Korean language, from ancient times to today, helping both humans and computers understand how the language has changed, split, and survived.