Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment

Imagine you are teaching a child to speak, or perhaps learning a new language yourself. You need a teacher who can listen to every single sound you make, catch the tiny mistakes, and tell you exactly how to fix them. In the real world, that teacher is a Speech-Language Pathologist (SLP)—a highly trained expert. But there's a problem: there aren't enough of these experts to help everyone, and they can't be everywhere at once.

Enter Harf-Speech, a new computer program designed to be that tireless, super-accurate digital teacher for Arabic speakers.

Here is how it works, broken down into simple concepts:

1. The Problem: The "One-Size-Fits-All" Trap

Think of existing speech apps (like the ones built by big tech companies) as mass-produced suits. They are made to fit everyone, but they don't fit anyone perfectly. They might work okay for English, but Arabic is a unique language with very specific sounds (like deep throat sounds and short vowels) that generic suits just don't accommodate. Furthermore, these "suits" are often "black boxes"—you put your voice in, and a score comes out, but you don't know why you got that score.

2. The Solution: A Custom-Tailored Suit

The researchers built Harf-Speech like a bespoke tailor for Arabic. Instead of using a generic suit, they created a system specifically designed to understand the unique "fabric" of the Arabic language.

Here is the step-by-step process, using a cooking analogy:

The Recipe (Reference): The computer has a perfect "recipe" for how a word should sound. It knows the exact ingredients (phonemes/sounds) needed.
The Tasting (Listening): You speak the word into the microphone.
The Chef's Critique (The AI): The system doesn't just listen to the whole sentence; it acts like a master chef tasting every single ingredient. It breaks your speech down into tiny sound bites (phonemes).
The Comparison: It compares your "dish" (your speech) against the perfect "recipe." Did you forget a pinch of salt (a missing sound)? Did you add too much pepper (an extra sound)? Did you use the wrong spice entirely (a wrong sound)?

3. The "Brain" Behind the System

To make this work, the team didn't just grab a random AI. They took three different types of AI "brains" (called ASR models) and trained them specifically on Arabic sounds.

Imagine taking a general knowledge student and giving them a crash course specifically on Arabic pronunciation.
They tested these trained students against "zero-shot" models (AI that tries to guess without training). The trained students (specifically one called OmniASR) were far superior, making fewer than 9 mistakes out of 100 sounds, compared to the untrained ones which made many more.

4. The "Human" Check (Clinical Validation)

This is the most important part. To prove their digital tailor was good, they didn't just trust the computer. They brought in three real-life expert speech therapists (the "Master Chefs" of the real world).

These experts listened to 40 different people speaking and gave them a score from 0 to 5.
Then, they compared the experts' scores with the computer's scores.

The Result? The computer and the human experts agreed 79% of the time.
To put that in perspective: If you asked two different human experts to grade the same speech, they would agree about 85-90% of the time. The computer got very close to human-level agreement, far outperforming the existing commercial apps (which only agreed about 63% of the time).

5. Why This Matters

Scalability: One human expert can only help a few people a day. Harf-Speech can help thousands simultaneously, 24/7.
Transparency: Unlike the "black box" apps, Harf-Speech tells you exactly which sound was wrong. It's like getting a report card that says, "You missed the 'R' sound in the middle of the word," rather than just a vague "You got a C."
Accessibility: It brings high-quality speech therapy tools to people who might not be able to afford or find a specialist nearby.

In a Nutshell

Harf-Speech is a smart, specialized digital coach that listens to Arabic speech, breaks it down to the tiniest sound level, and grades it with the accuracy of a human expert. It proves that by training AI specifically for a language's unique needs, we can create tools that are not just "good enough," but truly clinically reliable for helping people speak better.

1. Problem Statement

Automated pronunciation assessment is critical for scalable speech therapy and language learning, yet validated tools for Arabic are scarce. Existing solutions face several limitations:

Lack of Clinical Validation: Proprietary systems (e.g., Microsoft Azure) operate as "black boxes," are not localized for Arabic phonological characteristics, and lack validation against expert Speech-Language Pathologists (SLPs).
Phonological Complexity: Modern Standard Arabic (MSA) presents unique challenges, including a rich consonantal inventory, emphatic/pharyngeal phonemes, and the functional importance of short vowels and diacritics, requiring high sensitivity at the phoneme level.
Scalability vs. Accuracy: Reliance on human specialists limits scalability, while current automated systems often fail to provide interpretable, clinically grounded feedback at the phoneme level.

2. Methodology: Harf-Speech Framework

Harf-Speech is a modular, open-source framework designed to score Arabic pronunciation at the phoneme level on a clinical scale (0–5). The architecture consists of four main stages:

A. Reference Phoneme Generation

An MSA Phonetizer converts reference text into a canonical phoneme sequence.
The output is normalized into the "Harf" phoneme alphabet by removing positional suffixes, silence markers, and resolving geminates (doubled consonants).

B. Speech-to-Phoneme Prediction

The system converts the participant's audio directly into phoneme labels.
Model Strategy: Instead of relying on zero-shot multimodal models, the authors fine-tuned three state-of-the-art ASR architectures on Arabic phoneme data:
1. Wav2Vec2-LV-60-CV
2. Qwen3-ASR-1.7B
3. OmniASR-CTC-1B-v2 (Selected as the backbone due to superior performance).
Training data included native speech, synthetic mispronunciations (generated via TTS with confusion matrices), and real mispronunciations from human speakers.

C. Segmentation and Alignment

Word-Level Segmentation: A Large Language Model (LLM) conditions on text and phoneme sequences to align reference and predicted phonemes into word groups.
Phoneme Alignment: The Levenshtein distance algorithm is used to map substitutions, insertions, and deletions between the ground truth and the predicted sequence.

D. Scoring Algorithm

The system computes a final score (0–5) using a blended metric:

LCS Ratio: Longest Common Subsequence ratio (preserves order, tolerates isolated insertions).
Pronunciation Score: A weighted combination of:
- Accuracy: Based on substitutions, deletions, and insertions.
- Completeness: Based on deletions only.
Final Formula: $Harf\text{-}Speech\ Score = w_{lcs} \cdot LCS\ Ratio + w_{pron} \cdot PronScore$ $H a r f - S p eec h S cor e = w_{l cs} \cdot L C S R a t i o + w_{p r o n} \cdot P r o n S cor e$
- Default weights: $w_{lcs} = 0.6$ , $w_{pron} = 0.4$ .
- The result is linearly mapped to a 0–5 clinical scale.

3. Key Contributions

Clinically Validated Framework: The first complete, open-source framework for phoneme-level Arabic assessment explicitly optimized for Arabic phonology and validated against certified SLPs.
Model Benchmarking: Demonstrated that fine-tuned ASR models significantly outperform zero-shot multimodal models for Arabic phoneme recognition.
Expert Alignment: Established a rigorous evaluation protocol comparing automated scores against three independent, certified SLPs, providing a reproducible blueprint for clinical AI assessment.

4. Experimental Results

Phoneme Recognition Performance

Best Model: OmniASR-CTC-1B-v2 achieved the lowest Phoneme Error Rate (PER) of 8.92% with an extremely low Real-Time Factor (RTF) of 0.004.
Comparison:
- Zero-shot models (Gemini-3-pro) had much higher PER (15.07%) and prohibitive inference times (RTF 10.75).
- Wav2Vec2 achieved 13.58% PER.
- This confirms that task-specific fine-tuning is essential for Arabic.

Clinical Alignment & Validation

Three certified SLPs scored 40 diverse utterances on a 0–5 scale.

Inter-Rater Reliability: High agreement among human experts (Pearson Correlation Coefficient [PCC] 0.858–0.927; ICC $\ge$ 0.846).
System vs. Experts:
- Harf-Speech achieved a PCC of 0.791 and ICC(2,1) of 0.659 against the mean expert score.
- It achieved a 76.9% agreement rate within $\pm 1$ point of the expert score.
Comparison with Proprietary Baseline (Azure):
- Harf-Speech outperformed Azure Pronunciation Assessment significantly.
- Improvement: +0.156 in PCC, +0.066 in ICC, and a 16% reduction in Mean Absolute Error (MAE) (0.79 vs. 0.94).
- Azure showed wider scatter and lower correlation with individual SLPs.

5. Significance and Impact

Clinical Relevance: Harf-Speech provides scores that are statistically comparable to inter-rater expert agreement, making it a viable tool for clinical diagnostics and progress tracking.
Scalability: By automating the phoneme-level assessment with high accuracy, it reduces the burden on human specialists, enabling scalable speech therapy solutions for the 400+ million Arabic speakers.
Open Science: Unlike proprietary "one-size-fits-all" systems, Harf-Speech is built on open components, allowing for transparency, reproducibility, and adaptation to other languages or dialects.
Future Direction: The work demonstrates that localized, clinically grounded modeling can outperform generic commercial systems, setting a new standard for automated speech assessment in under-resourced languages.