Imagine you want to understand how a language changes over time, like watching a river flow from its source to the sea. To do this, you need a collection of old letters, books, and stories written at different times. This is exactly what the researchers behind SiDiaC-v.2.0 have built for the Sinhala language (spoken in Sri Lanka).
Here is the story of their project, explained simply with some creative analogies.
1. The Big Picture: Building a Time Machine
Think of the Sinhala language as a giant, ancient tree. For centuries, it has grown new branches, dropped old leaves, and changed its shape. The researchers wanted to create a "Time Machine" (a digital library) that lets us see exactly how the tree looked in the 5th century compared to the 20th century.
- The Old Version (v.1.0): They previously built a small, rough draft of this time machine. It had about 46 books. It was a great start, but it was like trying to understand a whole forest by looking at just a few saplings. It had some dirt on it, missing pages, and mixed-up words.
- The New Version (v.2.0): This is the upgraded, massive version. They expanded the library from 46 books to 185 books, containing over 240,000 words. It covers a huge timeline: from the year 1800 to 1955 (when the books were published) and even traces the stories back to when they were actually written (as far back as the 5th century!).
2. The Cleaning Process: Washing the Dust Off
Imagine you found a box of old, dusty family photos. Some are torn, some are written in a different language, and some are stuck together. To make them useful, you have to clean them up. The researchers did this with their digital books:
- The "Code-Mixing" Problem: Some old books were like a potluck dinner where the main dish was Sinhala, but someone had accidentally served a side of Pali, Sanskrit, or English. The researchers acted like strict chefs, carefully removing those foreign ingredients so the "Sinhala dish" remained pure.
- The "Commentary" Confusion: Some books were like a movie with a director's commentary track running over the whole film. The original story was from 500 years ago, but the commentary was from 200 years later. The researchers had to figure out which part was the "movie" and which was the "commentary" to date the text correctly.
- The "Poetry" Puzzle: Sinhala poetry is special. Poets often break words apart to make them rhyme, like snapping a Lego brick in half to fit it into a tight space. The researchers invented a special digital "glue" (a tag called
<psi>) to mark where these breaks happened. This way, computers can understand the word is whole, even if it looks broken in the text. - The "Two-Column" Mess: Old books often had text in two columns (left and right). When scanned by computers, the text would get jumbled, reading the left column, then the right, then back to the left. The researchers re-arranged this into a single, straight line, like unrolling a scroll so it reads naturally from top to bottom.
3. The "Low-Resource" Challenge
In the world of technology, some languages are like VIPs (like English) with millions of tools, dictionaries, and experts. Sinhala is a "Low-Resource" language. Imagine trying to build a house with only a hammer and a few nails, while everyone else has a full construction crew.
Because there weren't enough smart computer programs to automatically fix errors or tag words, the researchers had to do a lot of manual work. They acted as human detectives, reading through thousands of pages to fix typos and organize the data. This makes SiDiaC-v.2.0 a rare and precious treasure for the Sinhala language.
4. What Did They Find? (The Treasure Map)
Once the library was clean, they started looking for patterns. They used a technique called "Bag of Words" (imagine dumping all the words from a book into a bag and shaking it to see what sticks together).
They looked at two specific words that have multiple meanings, like chameleons changing colors:
- "Sathara" (සතර): This word can mean the number four, but it can also mean thief or skill.
- The Discovery: In older books, when people said "Sathara," they were mostly talking about the number four and religious wisdom (like the "four hells" in Buddhism). In the 19th century, the word for "thief" briefly appeared, but the religious meaning always dominated.
- "Maha" (මහ): This word can mean great/sacred, powerful, or big.
- The Discovery: In the 13th and 14th centuries, people used it to describe sacred things (like monks). By the 20th century, the meaning shifted. People started using it more to describe power (kings and strength) or just size (big mountains).
5. Why Does This Matter?
Think of language as a living museum. If we don't preserve the old exhibits, we lose the history of how our ancestors thought, felt, and spoke.
- For Computers: This library teaches AI how to understand Sinhala better, not just today, but how it was spoken 500 years ago.
- For Historians: It helps them see how culture changed. For example, they noticed that religious and poetic books were the most common, showing how deeply Buddhism and art influenced Sri Lankan history.
- For the Future: It ensures that the "river" of the Sinhala language doesn't dry up. It gives future generations a clear map to understand their own roots.
In short: SiDiaC-v.2.0 is a massive, meticulously cleaned-up digital time capsule. It takes the messy, dusty, and complex history of Sinhala literature and organizes it into a clear, usable format, allowing both humans and computers to explore the evolution of the language like never before.