Original authors: Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu, ChunJiang He, Lei Xie

Published 2026-05-19✓ Author reviewed ⓘ

📖 4 min read☕ Coffee break read

Original authors: Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu, ChunJiang He, Lei Xie

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine the world of artificial intelligence (AI) as a massive library. For years, this library has been stocked with books in English, Mandarin, and Spanish, but the section dedicated to Urdu—a language spoken by over 230 million people—has been nearly empty. It's like trying to teach a robot to speak a language using only a few scattered, dusty pamphlets.

This paper introduces UrduSpeech, a massive new "bookshelf" designed to fix that imbalance. Here is a simple breakdown of what the researchers built and how they did it.

1. The Problem: A Language Left Behind

Urdu is unique because it is written from right-to-left (like Arabic) and often mixes English words into sentences (a bit like a person switching between two dialects while telling a story). Because of these quirks, standard AI tools often get confused, treating Urdu like Hindi or failing to understand when the speaker switches languages. The researchers wanted to build a resource that respects these specific challenges.

2. The Solution: A 156-Hour "Sound Library"

The team created UrduSpeech, a collection of 156 hours of high-quality audio. To put that in perspective, if you listened to it non-stop, it would take you over six days to finish.

They didn't just dump random noise into a folder. They organized this library into three specific "rooms" (subsets):

US-Std: Standard Pakistani Urdu (the formal, "textbook" version).
US-CS: Code-switched Urdu (where speakers naturally mix Urdu and English, like saying "I need a chai and a coffee").
US-EngPk: English spoken with a Pakistani accent.

3. How They Built It: The "Smart Filter" Pipeline

Gathering this data was like trying to find specific gems in a pile of rocks. They collected 200 hours of audio from the internet (YouTube) and old archives (like 1980s TV shows). To clean it up, they used a three-step process:

Step 1: The Noise Canceller: They used AI tools to strip away background noise (like traffic or wind) and separate different voices in a conversation, ensuring only the main speaker was recorded.
Step 2: The "Strict Editor" (LLM): They used a powerful AI (Gemini 2.5 Pro) to act as a strict editor. This AI was given special instructions: "Do not translate English words into Urdu script; keep them as they sound," and "Do not confuse Urdu with Hindi." It also checked the audio for 12 different "vibe" tags (paralinguistics), such as the speaker's age, emotion, voice texture (is it raspy or smooth?), and accent.
Step 3: The Human Safety Net: Before the data was finalized, native Urdu speakers listened to samples to make sure the AI didn't make mistakes. They acted as the final quality control inspectors.

4. The "Gold Standard" Benchmark

To prove their library was good, they created a 9-hour "Gold Standard" set. This is a small, perfectly curated collection that humans manually checked and corrected. They used this to test different AI transcription models.

The Result: They found that most existing AI models struggled with Urdu, often getting the words wrong or mixing up the scripts. However, the model they chose (Gemini 2.5 Pro) performed significantly better, acting like a native speaker who understood the nuances of the language.

5. What's Inside the Library?

The final collection contains 71,792 separate audio clips. It's incredibly diverse:

Content: It includes everything from news and dramas to poetry, vlogs, and even rare forms of spoken poetry called Bait-Bazi.
People: It features a balanced mix of men and women, and speakers of all ages, from children to the elderly.
Quality: When humans listened to the audio, they gave it a high score (4.6 out of 5), confirming that the voices are clear and the transcriptions are accurate.

6. Why This Matters

Think of previous Urdu datasets as a small, locked room with a few chairs. UrduSpeech is a vast, open hall with thousands of seats, filled with people of all backgrounds speaking in all the ways they actually speak.

The researchers have made this library free and open for anyone to use. By providing this high-quality, well-organized data, they hope to help AI developers build better tools for Urdu speakers, ensuring that this major language is no longer left out of the digital future.

In short: They built a massive, meticulously organized sound library for Urdu, fixed the mistakes other AI tools made, and proved that with the right human and machine teamwork, even complex, mixed-language speech can be understood perfectly.

Technical Summary: UrduSpeech

1. Problem Statement

Despite having approximately 230 million speakers, Urdu remains critically under-resourced in the field of speech technology. Existing resources fail to address specific linguistic and acoustic challenges inherent to the language, including:

Script Constraints: The Right-to-Left (RTL) Perso-Arabic script.
Code-Switching: The ubiquity of Urdu-English code-switching (CS).
Acoustic Similarity: The acoustic proximity of Urdu to Hindi, leading to frequent misclassification.
Lack of Specialized Data: A shortage of high-fidelity data for nuanced tasks such as Machine Reading Comprehension, Deepfake detection, and Speech Emotion Recognition.
Resource Gaps: Existing datasets (e.g., ARL Urdu, Common Voice) often suffer from restrictive licensing, high costs, limited speaker diversity, or a lack of paralinguistic metadata.

2. Methodology

The authors developed UrduSpeech, a 156-hour corpus, through a multi-stage, LLM-driven curation pipeline designed to handle "in-the-wild" audio.

Data Collection and Preprocessing

Sources: 200 hours of raw audio were aggregated from YouTube and archival Pakistan Television (PTV) logs spanning four decades (1980s–present).
Preprocessing:
- Source Separation: Transitioned from Spleeter to the Demucs model for efficient vocal isolation.
- Speaker Diarization: Utilized Pyannote 3.1 to separate speakers, followed by manual global alignment to ensure ID consistency.
- Filtering: Segments shorter than 2 seconds, single-speaker clips, and those exceeding 35 seconds were discarded. This process removed 44 hours of residual noise, resulting in a final 156-hour corpus.

Model Selection and Benchmarking

A 13-hour pilot study was conducted to select the optimal transcription model. Three models were evaluated against native speaker ground truth:

Whisper-large-v3: Failed on code-switched audio, often transliterating English into Urdu script rather than maintaining literal content.
OmniASR-LLM-1B: Produced hallucinations in Arabic/Persian and exhibited word-looping on accented segments.
Gemini-2.5-Pro: Selected as the superior model due to its semantic awareness and prompt engineering capabilities. It achieved the lowest Word Error Rate (WER) and successfully maintained script fidelity (Urdu vs. Hindi) and literal transcription during code-switching.

Annotation Pipeline

A two-stage prompting strategy using Gemini 2.5-Pro was employed:

Transcription: Prompts enforced strict constraints to prevent Hindi/Devanagari script mixing and mandated literal transcription for code-switching.
Paralinguistic Metadata: A second prompt generated 12-dimensional metadata labels (e.g., pitch, texture, rhythm, age, accent) for each segment.

Quality Control: Segments with model confidence scores below 0.6 were discarded. The final dataset consists of 71,792 diarized clips.

Human-Centric Validation

Benchmark Set: A 9-hour subset (US-Benchmark) comprising US-Std, US-CS, and US-EngPk was manually corrected by native annotators to serve as the ground truth.
Assessment: 180 clips were sampled across three complexity levels and evaluated by six native Urdu speakers using a 5-point Likert scale (ITU-T P.800 protocol).
Metrics: Evaluated audio quality, transcription accuracy, demographics, prosody, affect, articulation, and contextual accuracy.

3. Key Contributions

UrduSpeech Pipeline: A robust framework capable of filtering raw audio, performing speaker diarization, handling RTL constraints, and differentiating between Hindi and Urdu in code-switched environments.
US-Benchmark Set: A 9-hour, manually verified benchmark set with 12-dimension paralinguistic metadata, establishing a new ground truth for error analysis.
UrduSpeech Corpus: A 156-hour open-source corpus containing:
- 59.2 hours of US-Std (Standard Pakistani Urdu).
- 89.4 hours of US-CS (Code-switched Urdu-English).
- 7.3 hours of US-EngPk (Pakistani-accented English).
- 71,792 utterances with comprehensive paralinguistic labels (emotion, texture, accent).
SOTA Evaluation: An in-depth evaluation of Gemini 2.5-Pro, Whisper-large-v3, and OmniASR-LLM-1, establishing baselines for high-fidelity transcription in Urdu.

4. Results

Transcription Performance: Gemini-2.5-Pro significantly outperformed other models, achieving a WER of 0.023 (without code-switching) and 0.028 (with code-switching), compared to ~0.28–0.53 for Whisper and OmniASR.
Human Quality Assessment:
- Mean Opinion Score (MOS): The corpus achieved a global MOS of 4.64 ( $\sigma = 0.74$ ).
- Reliability: 92.78% of ratings were 4 or 5. Inter-rater reliability showed a Cohen's $\kappa$ of 0.678 for Set B and 0.545 for Set C.
- Confidence: The curation pipeline demonstrated a 97.6% confidence score based on model outputs and human validation.
Demographics: The corpus maintains a 60/40 gender balance (42,990 male vs. 28,802 female utterances) and includes diverse age groups (Young Adult, Middle Age, Child, Elderly).
Distribution: The data covers 12 categories including news, drama, poetry, vlogs, and rare literary forms like Bait-Bazi.

5. Significance and Claims

The paper positions UrduSpeech as a significant leap toward linguistic inclusivity in global AI. Its primary significance lies in:

Bridging the Digital Divide: Providing accurate linguistic representation for a language with 230 million speakers that has been under-served by multimodal foundation models.
Granular Metadata: Being the first resource to integrate a 12-dimension paralinguistic metadata framework, enabling high-resolution error analysis and research into affective computing and speaker profiling.
Addressing Code-Switching: Specifically tackling the "in-the-wild" gap by providing a large-scale dataset for Urdu-English code-switching and Pakistani-accented English.
Open Science: Unlike many foundational datasets that are licensed or paid, the corpus and pipeline are open-sourced, aiming to facilitate future research in Urdu and other under-resourced Perso-Arabic script languages.

The authors note limitations, including a conservative estimate of unique speakers (1,000+ vs. 3,000 detected clusters) due to potential over-segmentation in wild recordings, and the presence of residual background noise in some segments. Future work is directed toward establishing baseline benchmarks for ASR/TTS and implementing forced alignment for word-level precision.

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations