Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

This paper introduces Nw\=ach\=a Mun\=a, the first manually transcribed Devanagari speech corpus for the endangered Nepal Bhasha, and demonstrates that proximal cross-lingual transfer from Nepali achieves competitive automatic speech recognition performance comparable to large multilingual models while being significantly more computationally efficient.

Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha, Rupak Raj Ghimire, Bal Krishna Bal

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Nwāchā Munā" using simple language and creative analogies.

🗣️ The Big Problem: A Language Left in the Dark

Imagine the digital world as a giant, bustling library. Most languages (like English, Spanish, or Hindi) have huge, well-lit sections with millions of books, audiobooks, and helpful librarians. But Nepal Bhasha (also known as Newari), a language spoken by nearly a million people in Nepal, is stuck in a dusty, dark corner with almost no books.

Because there are so few recorded conversations (audio data) in Nepal Bhasha, computers can't learn to understand it. This is like trying to teach a dog to speak French when you only have one sentence written on a napkin. The result? The language is "digitally marginalized," meaning people who speak it can't easily use voice assistants, dictation software, or AI tools.

🎤 The Solution: "Nwāchā Munā" (The Voice Collection)

The researchers decided to fix this by creating a new library of voices. They call their project Nwāchā Munā (which roughly translates to "Voice Collection" or "Gathering of Voices").

  • What they did: They went out, recorded 5.39 hours of native speakers talking naturally, and carefully wrote down exactly what they said (transcription) using the Devanagari script (the same script used for Hindi and Nepali).
  • The Analogy: Think of this as gathering 18 different people to read stories aloud into a high-quality microphone. They made sure to record men, women, and people of different ages to capture the full "flavor" of the language. This created the first-ever "training manual" for computers to learn Newari.

🧠 The Big Question: Can a "Neighbor" Teach a "Stranger"?

Usually, to teach a computer a rare language, you need a massive, super-smart AI model that has read every language in the world (like the famous Whisper model). These models are like giant, heavy trucks—they need a lot of fuel (computing power) and a massive highway (data) to run.

The researchers asked a clever question: "Do we need a giant truck, or can a small, local bicycle get us there?"

  • The Neighbor: They looked at Nepali, a language spoken right next to Newari. They share the same alphabet (script) and sound very similar, like cousins.
  • The Experiment: Instead of training a giant model from scratch, they took a computer model that was already an expert in Nepali and gave it a "crash course" in Newari using their new 5-hour recording.

🏆 The Results: The Small Bicycle Wins!

The results were surprising and exciting:

  1. The "Zero-Shot" Fail: When they first tried to use the Nepali expert to understand Newari without any training, it was terrible (like a chef trying to cook a dish they've never seen). It got about 52% of the words wrong.
  2. The "Fine-Tuning" Success: Once they gave the Nepali model a little bit of Newari data to study, it improved massively.
  3. The Data Boost: By using a technique called Data Augmentation (which is like taking the 5 hours of audio and creating "fake" variations—speeding it up, slowing it down, adding background noise to make the model tougher), they made the system even smarter.

The Winner: The small, Nepali-trained model (with data augmentation) performed just as well as the massive, global Whisper model, but it used way less computing power.

  • The Analogy: It's like realizing you don't need a 100-foot tall telescope to see the moon; a small, well-focused pair of binoculars works just as well if you know exactly where to look.

🔍 What Went Wrong? (The Glitches)

Even with the success, the computer still makes mistakes, mostly because Newari is a "sticky" language.

  • The Glue Analogy: In English, words are like separate Lego bricks. In Newari, words are like glue—they stick together, and tiny marks (diacritics) change the meaning entirely.
  • The Error: The computer often gets the main words right but messes up the tiny "glue" marks (like nasal sounds or breathy stops). It's like hearing a sentence perfectly but missing the punctuation that changes "Let's eat, Grandma!" to "Let's eat Grandma!"

🌍 Why This Matters

This paper is a blueprint for the future. It proves that for endangered languages, you don't need to wait for a super-computer to save you.

  • Community Power: By using a "neighbor" language (Nepali) and a small, curated dataset, communities can build their own voice technology.
  • Preservation: This helps save the language. If a language can speak to a computer, it can survive in the digital age.

In a nutshell: The researchers built a small, high-quality voice library for a rare language and proved that a smart, local model (trained on a neighbor language) can do the job just as well as a giant, expensive global model. It's a victory for efficiency and for keeping cultural heritage alive in the AI era.