SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

Imagine the legal system as a massive, ancient library. In this library, the books (laws) are written in a very specific, formal, and sometimes tricky language. For a long time, this library in Sri Lanka had a problem: the books were written in Sinhala, but most of the "robots" (AI and computers) that try to read and understand them only speak English or other major languages. Furthermore, many of these Sinhala books were just dusty, scanned pictures of paper, not actual text that a computer could read.

This paper introduces SINHALEGAL, a project that built a brand new, super-organized digital library specifically for these Sinhala laws.

Here is the story of how they did it, explained simply:

1. The Problem: A Wall of Scanned Pictures

Think of the original laws as thousands of old, scanned photocopies. If you tried to read them with a computer, the computer would just see a picture, not words. It's like trying to teach a dog to read a photograph of a book; the dog can see the ink, but it can't understand the story.

Also, legal language is like a secret code. It uses complex sentences and special words that everyday people (and standard AI) don't use. If you ask a general AI to read a law, it might get confused because the "code" is too different from normal conversation.

2. The Solution: Building the "SINHALEGAL" Library

The researchers decided to build a clean, digital version of this library. They didn't just copy the pictures; they turned them into real, searchable text.

The Collection: They gathered about 1,206 legal documents (called "Acts" and "Bills") from 1981 to 2014. Imagine gathering 1,206 thick textbooks.
The Magic Scanner (OCR): They used a high-tech scanner (Google Document AI) to look at the pictures and guess what the letters were. It's like a very smart robot trying to read handwriting.
The "Human Polish": The robot isn't perfect. Sometimes it mistakes a letter for a number or misses a word. The researchers (who are native Sinhala speakers) acted like editors. They went through every single document, fixing typos, removing page numbers that got stuck in the middle of sentences, and deleting watermarks. They made sure the text flowed smoothly, like a well-edited novel.

3. What's Inside the Library?

Once the library was built, they didn't just leave it there; they analyzed it to see what was inside.

The Vocabulary: They found that legal Sinhala is very repetitive. It's like a song with a chorus that repeats over and over. Words like "Act," "Parliament," and "Minister" appear constantly. This makes the language very structured but also very specific.
The "Hidden Gems" (Entities): They taught the computer to spot important things, like Dates, Names of Officials, Court Names, and Money amounts. It's like giving the computer a highlighter pen to mark the most important parts of the story.
The Themes: They used a technique called "Topic Modeling" to ask the computer, "What are these books mostly about?" The computer said: "Oh, mostly about Laws, Courts, Elections, and Money."

4. The "Stress Test": Can AI Understand This?

The researchers wanted to see if modern AI (the "smart robots" of today) could actually understand these laws. They fed the new library into several famous AI models (like Llama, Mistral, and Falcon).

The Result: The AI did surprisingly well! In fact, the AI found the legal text easier to predict than normal, everyday Sinhala text.
Why? Because legal text is so repetitive and follows strict rules, it's like a pattern the AI can easily guess. It's like playing a game where the rules never change, so you can predict the next move easily.

5. Why Does This Matter?

Before this project, if you wanted to use AI to analyze Sri Lankan laws, you had nothing to work with. It was like trying to bake a cake without flour.

SINHALEGAL provides the "flour." Now, researchers can:

Build tools that summarize long laws into short summaries.
Create chatbots that can answer legal questions in Sinhala.
Analyze how laws have changed over the last 40 years.

The Bottom Line

This paper is about bridging the gap between old, dusty legal documents and the future of AI. By cleaning up the text and organizing it, the authors have given the world a powerful new tool to understand and work with the laws of Sri Lanka, ensuring that technology can finally "speak" the language of the law.

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

1. The Problem: A Wall of Scanned Pictures

2. The Solution: Building the "SINHALEGAL" Library

3. What's Inside the Library?

4. The "Stress Test": Can AI Understand This?

5. Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Acquisition & Organization

B. Text Extraction (OCR)

C. Data Filtration

D. Post-Processing

E. Metadata Creation

3. Key Contributions

4. Results & Evaluation Findings

5. Significance

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

1. The Problem: A Wall of Scanned Pictures

2. The Solution: Building the "SINHALEGAL" Library

3. What's Inside the Library?

4. The "Stress Test": Can AI Understand This?

5. Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. Data Acquisition & Organization

B. Text Extraction (OCR)

C. Data Filtration

D. Post-Processing

E. Metadata Creation

3. Key Contributions

4. Results & Evaluation Findings

5. Significance

More like this

Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection

NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution

Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis