Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Here is an explanation of the paper "Nwāchā Munā" using simple language and creative analogies.

🗣️ The Big Problem: A Language Left in the Dark

Imagine the digital world as a giant, bustling library. Most languages (like English, Spanish, or Hindi) have huge, well-lit sections with millions of books, audiobooks, and helpful librarians. But Nepal Bhasha (also known as Newari), a language spoken by nearly a million people in Nepal, is stuck in a dusty, dark corner with almost no books.

Because there are so few recorded conversations (audio data) in Nepal Bhasha, computers can't learn to understand it. This is like trying to teach a dog to speak French when you only have one sentence written on a napkin. The result? The language is "digitally marginalized," meaning people who speak it can't easily use voice assistants, dictation software, or AI tools.

🎤 The Solution: "Nwāchā Munā" (The Voice Collection)

The researchers decided to fix this by creating a new library of voices. They call their project Nwāchā Munā (which roughly translates to "Voice Collection" or "Gathering of Voices").

What they did: They went out, recorded 5.39 hours of native speakers talking naturally, and carefully wrote down exactly what they said (transcription) using the Devanagari script (the same script used for Hindi and Nepali).
The Analogy: Think of this as gathering 18 different people to read stories aloud into a high-quality microphone. They made sure to record men, women, and people of different ages to capture the full "flavor" of the language. This created the first-ever "training manual" for computers to learn Newari.

🧠 The Big Question: Can a "Neighbor" Teach a "Stranger"?

Usually, to teach a computer a rare language, you need a massive, super-smart AI model that has read every language in the world (like the famous Whisper model). These models are like giant, heavy trucks—they need a lot of fuel (computing power) and a massive highway (data) to run.

The researchers asked a clever question: "Do we need a giant truck, or can a small, local bicycle get us there?"

The Neighbor: They looked at Nepali, a language spoken right next to Newari. They share the same alphabet (script) and sound very similar, like cousins.
The Experiment: Instead of training a giant model from scratch, they took a computer model that was already an expert in Nepali and gave it a "crash course" in Newari using their new 5-hour recording.

🏆 The Results: The Small Bicycle Wins!

The results were surprising and exciting:

The "Zero-Shot" Fail: When they first tried to use the Nepali expert to understand Newari without any training, it was terrible (like a chef trying to cook a dish they've never seen). It got about 52% of the words wrong.
The "Fine-Tuning" Success: Once they gave the Nepali model a little bit of Newari data to study, it improved massively.
The Data Boost: By using a technique called Data Augmentation (which is like taking the 5 hours of audio and creating "fake" variations—speeding it up, slowing it down, adding background noise to make the model tougher), they made the system even smarter.

The Winner: The small, Nepali-trained model (with data augmentation) performed just as well as the massive, global Whisper model, but it used way less computing power.

The Analogy: It's like realizing you don't need a 100-foot tall telescope to see the moon; a small, well-focused pair of binoculars works just as well if you know exactly where to look.

🔍 What Went Wrong? (The Glitches)

Even with the success, the computer still makes mistakes, mostly because Newari is a "sticky" language.

The Glue Analogy: In English, words are like separate Lego bricks. In Newari, words are like glue—they stick together, and tiny marks (diacritics) change the meaning entirely.
The Error: The computer often gets the main words right but messes up the tiny "glue" marks (like nasal sounds or breathy stops). It's like hearing a sentence perfectly but missing the punctuation that changes "Let's eat, Grandma!" to "Let's eat Grandma!"

🌍 Why This Matters

This paper is a blueprint for the future. It proves that for endangered languages, you don't need to wait for a super-computer to save you.

Community Power: By using a "neighbor" language (Nepali) and a small, curated dataset, communities can build their own voice technology.
Preservation: This helps save the language. If a language can speak to a computer, it can survive in the digital age.

In a nutshell: The researchers built a small, high-quality voice library for a rare language and proved that a smart, local model (trained on a neighbor language) can do the job just as well as a giant, expensive global model. It's a victory for efficiency and for keeping cultural heritage alive in the AI era.

Here is a detailed technical summary of the paper "Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR."

1. Problem Statement

Nepal Bhasha (Newari), an endangered language spoken by over 860,000 people in the Kathmandu Valley, suffers from severe digital marginalization. Despite being an official working language of Bagmati Province, it lacks annotated speech resources necessary for building Automatic Speech Recognition (ASR) systems.

The Gap: Modern end-to-end ASR architectures (like Transformers and Conformers) require massive datasets, which are unavailable for Newari.
The Challenge: Existing ASR research in Nepal focuses on Nepali (a related language), while Newari remains underrepresented. Previous attempts at Newari ASR often relied on Romanized transliteration rather than the native Devanagari script, limiting orthographic consistency.
Research Question: Can proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) achieve performance comparable to massive multilingual pre-training (e.g., Whisper) in an ultra-low-resource setting?

2. Methodology

A. Data Collection: The Nwāchā Munā Corpus

The authors curated Nwāchā Munā, a 5.39-hour manually transcribed speech corpus for Nepal Bhasha, preserving Devanagari orthography.

Sources:
- Text: Compiled from Nepal Bhasha Wikipedia, OSCAR dataset, regional newspapers, literary manuscripts (Bloom Library), and school textbooks.
- Audio:
  - Field Recordings: 4h 21m recorded from 18 native speakers (stratified by age and gender) in Banepa, Dhulikhel, Panauti, and Patan.
  - Web Sources: ~1 hour of existing web audio, originally Romanized, was manually transliterated into Devanagari by community members.
Statistics: 5,727 utterances, mean length of 3.39 seconds.
Preprocessing: Audio standardized to 16kHz mono WAV; text rigorously cleaned of non-target tokens.

B. Model Training Strategies

The study employed a comparative framework to evaluate three main approaches:

Zero-Shot Baseline: Evaluating a pre-trained NepConformer (Nepali ASR model) on Newari data without fine-tuning.
Supervised Fine-Tuning (Proximal Transfer):
- Full Fine-Tuning: Adapting the entire NepConformer model.
- Decoder-Only Fine-Tuning: Freezing the encoder (to preserve Nepali acoustic features) and only updating the prediction/joint networks.
- Data Augmentation: Applied static (speed perturbation, volume randomization) and dynamic (time stretching, pitch shifting, noise injection) augmentation to expand the training set.
Multilingual Baseline: Fine-tuning Whisper-Small (244M parameters) on the Newari dataset, forcing the decoder to the Nepali language token to leverage script compatibility.
Decoding Enhancements:
- Shallow Fusion: Integration of an external KenLM (5-gram) language model during beam search decoding.
- Semi-Supervised Learning: Attempted pseudo-labeling on 13.65 hours of unlabeled broadcast data, though this was found to degrade performance due to domain shift.

3. Key Contributions

Nwāchā Munā Corpus: The release of the first carefully curated, Devanagari-script speech corpus for Nepal Bhasha (5.39 hours), addressing the critical lack of annotated resources.
Proximal Transfer Benchmark: The first controlled comparison between proximal transfer (Nepali $\to$ Newari) and multilingual pre-training (Whisper) in an ultra-low-resource context.
Efficiency Demonstration: Proof that script-preserving proximal transfer can match the performance of massive multilingual models while utilizing significantly fewer parameters and computational resources.

4. Experimental Results

The primary evaluation metric was Character Error Rate (CER).

Strategy	Configuration	CER (%)
Zero-Shot	Pre-trained NepConformer	52.54
Semi-Supervised	NepConformer + Pseudo-labels	19.83
Shallow Fusion	NepConformer + KenLM	19.75
Decoder-Only	NepConformer (Encoder frozen)	18.77
Multilingual	Whisper-Small (Fine-tuned)	18.76
Proximal Transfer	NepConformer (Fine-tuned)	18.72
Augmented	Whisper-Small + Augmentation	17.88
SOTA	NepConformer + Augmentation	17.59

Key Findings:

Zero-Shot Failure: The high zero-shot CER (52.54%) confirms that despite sharing the Devanagari script, the phonological distinctiveness of Newari requires explicit fine-tuning.
Parity with Whisper: The fine-tuned NepConformer (18.72%) effectively matched the performance of the much larger Whisper-Small (18.76%), demonstrating that linguistic proximity can outweigh model scale in this specific context.
Impact of Augmentation: Data augmentation was critical, reducing the best NepConformer CER from 18.72% to 17.59%.
Decoder-Only Efficiency: Freezing the encoder and only fine-tuning the decoder yielded results (18.77%) nearly identical to full fine-tuning, suggesting the pre-trained Nepali acoustic encoder is sufficiently generalized for Newari.
Semi-Supervised Limitation: Adding pseudo-labeled broadcast data increased the CER (to 19.83%), highlighting that in ultra-low-resource settings, domain alignment is more critical than raw data quantity.

5. Error Analysis

Morphological Complexity: Newari is highly agglutinative. While Character Error Rate (CER) is relatively low, Word Error Rate (WER) is higher because the model struggles to assemble characters into correct morphological structures.
Specific Errors: The most frequent errors involve diacritics (Halant, Anusvara, Candrabindu, Visarga) and word boundaries. The model often omits or misplaces these markers, which are crucial for phonetic distinctions in Newari.
Language Model Trade-off: Integrating KenLM reduced WER by ~11.7% but slightly increased CER (1.37%). This occurred because the LM favored standard spellings over phonetically correct but non-standard local variations.

6. Significance and Conclusion

Scalable Blueprint: The paper provides a scalable blueprint for other endangered South Asian languages. It proves that leveraging a geographically proximal, script-sharing language (Nepali) is a computationally efficient alternative to training massive multilingual models from scratch.
Digital Inclusion: By releasing the dataset and benchmarks, the authors enable the development of voice-driven AI tools for the Newari community, aiding in the preservation of linguistic heritage.
Future Directions: The study highlights the need for better handling of agglutinative structures and diacritics in low-resource ASR, as well as the importance of domain-specific data over generic broadcast data for pseudo-labeling.

In summary, Nwāchā Munā successfully bridges the digital divide for Nepal Bhasha by demonstrating that high-quality, script-preserving proximal transfer can achieve state-of-the-art results with minimal resources, offering a viable path forward for ASR in endangered language communities.