DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

This paper introduces DohaScript, a large-scale, multi-writer dataset of continuous handwritten Hindi text derived from 531 contributors transcribing identical traditional couplets, designed to address the scarcity of high-quality benchmarks for Devanagari script analysis and support diverse tasks like recognition, writer identification, and style modeling.

Kunwar Arpit Singh, Ankush Prakash, Haroon R Lone

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to read handwritten notes. For languages like English, you have millions of practice books filled with different people's handwriting. But for Hindi, which uses the Devanagari script, the robot is like a student trying to learn to read with only a few torn scraps of paper.

This paper introduces DohaScript, a massive new "library" designed to fix that problem. Here is the story of how they built it and why it matters, explained simply.

1. The Problem: The "Missing Library"

Hindi is spoken by hundreds of millions of people, yet there is almost no high-quality, large-scale data for computers to learn from.

  • The Old Way: Existing datasets were like a box of loose LEGO bricks. They had individual letters or short words, but they didn't show how words connect in a sentence.
  • The Hindi Challenge: In Devanagari, letters aren't just sitting next to each other; they are glued together by a "roof" (called a shirorekha) that runs across the top of the word. It's like a train where the cars are fused together. If you try to teach a computer using isolated bricks, it won't understand how to read the train.
  • The Result: Computers struggle to read continuous Hindi handwriting because they lack a big, diverse dataset to practice on.

2. The Solution: The "Doha" Experiment

The researchers created DohaScript. Think of this as a massive "copycat" game, but with a twist.

  • The Script: They asked 531 different people to write the exact same six poems (called dohas, which are traditional rhyming couplets).
  • Why Poems? These poems are famous, taught in schools, and contain every single letter and sound in the Hindi alphabet. It's like asking everyone to write the same alphabet soup, but in a sentence.
  • The Magic: Because everyone wrote the same words, the researchers can now compare the handwriting styles directly. It's like having 531 different artists paint the exact same landscape. You can see exactly how Person A's brushstrokes differ from Person B's, without the confusion of different subject matter.

3. The "Quality Control" Filter

Collecting 531 handwritten pages is great, but what if some are blurry, smudged, or taken in the dark?

  • The Robot Inspector: The team built a smart AI "inspector" (a CNN classifier) to grade every page.
  • The Grading System: They didn't just throw away bad photos. Instead, they sorted them into buckets:
    • The "Gold Standard" Bucket: Crisp, clear, high-quality pages perfect for teaching a computer to read.
    • The "Real World" Bucket: Pages that are a bit blurry, shaky, or messy.
  • Why keep the messy ones? Because real life is messy! By keeping the "bad" samples, the researchers can teach computers how to read handwriting even when the lighting is poor or the pen is running out of ink. It's like training a driver not just on a perfect race track, but also on rainy, pothole-filled streets.

4. The "Tangled Thread" Challenge

Even if the handwriting is clear, reading it is hard because of how the lines are arranged.

  • The Analogy: Imagine trying to read a book where the lines of text are sometimes perfectly straight, sometimes wavy, and sometimes the letters from one line dip down and touch the line below.
  • The Annotation: The researchers analyzed every page and labeled them by difficulty:
    • Easy: Neat lines, easy to separate.
    • Medium: A little wobbly.
    • Complex: A tangled mess where lines overlap and bleed into each other.
  • The Goal: This helps researchers build systems that can untangle the "thread" of text, even when the writer didn't use a ruler.

5. Why This Matters (The "So What?")

This dataset is a game-changer for several reasons:

  • For Reading (OCR): It helps computers finally read handwritten Hindi forms, medical prescriptions, and historical documents accurately.
  • For Identity: Since everyone wrote the same words, the computer can learn to recognize who wrote it based on their unique style (like a digital fingerprint).
  • For Art: It allows AI to generate new handwriting that looks like a specific person's style.
  • For Fairness: It includes people from all over India, different ages, and genders, ensuring the technology works for everyone, not just a select few.

The Bottom Line

DohaScript is like giving the world of AI a massive, organized, and diverse training camp for reading Hindi. It moves us from trying to learn with a few scattered puzzle pieces to having the whole picture, complete with instructions on how to handle the messy, blurry, and difficult parts of real-life handwriting.

The dataset is now publicly available, meaning any researcher can download it and start building better tools to understand and preserve the written word of hundreds of millions of people.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →