CER-HV: A Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Imagine you are trying to teach a robot to read handwritten letters from history. You want it to read Arabic, Persian, Urdu, and other languages written in the Arabic script. You give the robot a massive stack of handwritten notes and say, "Here, learn from these!"

But there's a problem. The stack of notes you gave the robot is messy. Some pages are upside down. Some have stamps or signatures drawn right over the text. Some lines are cut off in the middle. And worst of all, some of the notes have been transcribed (typed out) with the wrong words.

If you let the robot study this messy stack, it will get confused. It might think that upside-down text is normal, or that a stamp is part of a word. It will learn the mistakes instead of the language.

This paper is about a new method called CER-HV to fix this mess before the robot starts its final exam.

The Problem: The "Garbage In, Garbage Out" Trap

For a long time, researchers thought the reason robots were bad at reading Arabic handwriting was that the language itself is too hard. Arabic letters change shape depending on where they are in a word, and they are all connected like a snake.

The authors of this paper said, "Wait a minute. Maybe the language isn't the problem. Maybe the textbooks we are giving the robots are full of errors."

They looked at six popular datasets (collections of handwritten text used for training) and found they were indeed full of hidden errors:

Transcription Errors: The typed text didn't match the handwriting.
Segmentation Errors: Two different lines of text were glued together into one image.
Orientation Errors: The text was rotated 90 or 180 degrees.
Script Mismatch: The text was actually in a different language (like Latin letters) but labeled as Arabic.

The Solution: The "Smart Librarian" (CER-HV)

The authors built a framework called CER-HV (Character Error Rate-based Ranking with Human Verification). Think of it as a Smart Librarian who helps clean the library before the students (the AI models) start studying.

Here is how the Smart Librarian works in two steps:

Step 1: The Robot's "Stumble Test" (Automated Scoring)

First, the authors train a basic robot (a CRNN model) on the messy data. They don't expect it to be perfect yet. They just want to see where it stumbles.

The Analogy: Imagine you give a student a practice test. If they get a question wrong, it could be because the question is too hard, OR it could be because the answer key is wrong.
The Trick: The authors realized that if a robot makes a huge mistake on a specific line of text, it's a strong signal that something is wrong with that line. They calculate a "Stumble Score" (called CER) for every single line of text.
The Filter: They sort the entire library by this score. The lines where the robot stumbled the most go to the top of the pile. These are the "suspects."

Step 2: The Human Detective (Human Verification)

Now, the robot is good at finding suspects, but it's not good at judging why they are suspects. Sometimes the text is just really messy handwriting (a "hard" sample), not a mistake.

The Analogy: This is where a human detective steps in. The human only looks at the top 10% of the "stumble pile" (the ones with the highest scores).
The Decision: The human looks at the image and the label and asks: "Is the label wrong? Is the image upside down? Is there a stamp on it?"
- If yes: They fix it or throw it out.
- If no: They keep it, realizing it was just a very difficult piece of handwriting.

The Results: Cleaning Up the Classroom

When the authors used this "Smart Librarian" to clean the datasets, the results were amazing:

The Robot Got Smarter: When they retrained the robot on the clean data, it got significantly better at reading. On some datasets, the error rate dropped by nearly 2%. In the world of AI, that's a huge jump.
The Baselines Were Wrong: They found that previous "best" scores for these languages were actually based on messy data. Once they cleaned the data, the "new best" scores were actually lower (better) than anyone thought possible.
A Simple Model Won: They also showed that you don't need a super-complex, expensive AI model to get great results. A well-tuned, simpler model (CRNN) performed just as well as the fancy, complex ones once the data was clean.

The Big Takeaway

The main lesson of this paper is simple: Don't blame the student if the textbook is broken.

For years, researchers tried to build bigger, smarter, more complex AI models to solve the "hard problem" of Arabic handwriting. This paper shows that the real problem was the data. By using a simple, two-step process (let the robot find the weird stuff, then have a human check it), they cleaned up the datasets and made the AI much smarter.

It's a reminder that in the age of AI, data quality is just as important as the model itself. You can have the best engine in the world, but if you put muddy fuel in it, the car won't run. CER-HV is just the filter that cleans the fuel.

1. Problem Statement

Handwritten Text Recognition (HTR) for Arabic-script languages (including Arabic, Persian, Urdu, Pashto, and Ajami) lags significantly behind Latin-script HTR. While recent research has focused heavily on improving model architectures (e.g., Transformers), this paper argues that data quality is a critical, under-addressed limiting factor.

Key issues identified include:

Noisy Labels: Existing datasets often contain transcription errors, segmentation issues (truncated lines or multiple lines in one image), orientation mistakes, script mismatches, and non-text content (stamps, signatures).
Ineffective Noise Detection: Standard noise detection methods used in image classification (relying on training loss or confidence scores) are ill-suited for HTR. In Connectionist Temporal Classification (CTC) based HTR, loss is influenced by alignment uncertainty and sequence length, making it a poor indicator of label correctness.
Benchmark Distortion: These errors distort evaluation metrics, leading to unreliable benchmarks and potentially misleading conclusions about model performance.

2. Methodology: The CER-HV Framework

The authors propose CER-HV (CER-based Ranking with Human Verification), a two-stage framework designed to detect and clean label errors in line-level HTR datasets.

A. Stage 1: Automated Noise Detection (CER-Based Ranking)

Instead of using training loss, the framework leverages the Character Error Rate (CER) as a scoring metric.

Model: A carefully configured CRNN (Convolutional Recurrent Neural Network) is trained on the dataset.
- Architecture: Based on the "Best Practices" CRNN [36], featuring a deeper ResNet-style CNN backbone, BiLSTM layers, and an auxiliary CTC shortcut branch for better convergence.
- Training Strategy: The model uses Early Stopping based on validation CER to prevent overfitting to noisy samples (a phenomenon where networks memorize noise late in training).
Scoring: Once the model converges, it predicts the text for every sample in the dataset. The CER is calculated between the prediction and the ground-truth label for each sample.
Ranking: Samples are ranked by their CER scores. High CER scores indicate a high probability of label noise or difficult content.

B. Stage 2: Human-in-the-Loop (HITL) Verification

Automated scoring alone cannot distinguish between "noisy labels" and "correctly labeled but visually difficult" samples.

Thresholding: Samples with a CER above a specific threshold ( $\tau = 0.25$ ) are selected for human review.
Human Inspection: Experts verify these high-risk samples and categorize them into:
- Errors: Transcription errors, Segmentation errors, Orientation errors, Script mismatches, Irrelevant/Non-text content.
- Valid but Hard: Correctly labeled samples that are visually challenging (e.g., heavy diacritics, low clarity).
Cleaning: Identified errors are fixed or removed; "Valid but Hard" samples are retained. The dataset is then retrained.

3. Key Contributions

Systematic Error Analysis: The first comprehensive analysis of label and content errors across six major Arabic-script HTR datasets, defining a taxonomy of five error types.
CER-HV Framework: A novel adaptation of learning-dynamics-based noise detection for CTC-based sequence recognition, replacing loss-based ranking with CER and utilizing early stopping.
Strong Baselines: Established a robust CRNN baseline that achieves State-of-the-Art (SoTA) results on multiple datasets without synthetic data or Transformer architectures.
New Benchmarks: Provided the first line-level benchmarks for Persian (PHTD) and Ajami, including cleaned evaluation splits and extracted line images.
Open Resources: Released all code, cleaned data splits, and human-verified error annotations to ensure reproducibility.

4. Experimental Results

A. Baseline Performance (Without Cleaning)

The proposed CRNN achieved SoTA results on five of the six datasets without any data cleaning:

KHATT (Arabic): 8.45% CER (beating TrOCR and other Transformers).
Muharaf (Arabic): 10.11% CER (improving the baseline by ~8% absolute).
PHTI (Pashto): 8.26% CER (reducing previous CER from 20.7%).
Ajami: 10.66% CER (massive improvement over 64–84% zero-shot results).
PHTD (Persian): Established a new baseline of 11.3% CER.

B. Noise Detection Precision

The CER-based detector showed high precision in identifying genuine errors among the top-ranked samples:

Muharaf: 70–90% precision.
PHTI: 80–88% precision.
Ajami: 68–71% precision.
Note: Lower precision in some cases was due to "Valid but Hard" samples being flagged, which the human verification step successfully filtered out.

C. Impact of Cleaning

Cleaning the datasets (both training and evaluation sets) yielded significant improvements:

Evaluation Set Cleaning: Reduced evaluation CER by 0.3–0.6% on cleaner datasets (KHATT, NUST-UHWR) and 1.0–1.8% on noisier datasets (Muharaf, Ajami).
Training Set Cleaning: Retraining on cleaned data provided additional gains, particularly for Ajami (where noise density was highest), improving validation CER from 9.50% to 9.01%.
Conclusion: The results prove that a portion of the difficulty in Arabic-script HTR is due to noisy labels, not just script complexity.

5. Significance and Conclusion

This paper fundamentally shifts the focus in Arabic-script HTR from purely architectural innovation to data quality assurance.

Generalizability: While focused on Arabic scripts, the CER-HV framework is applicable to any line-level text recognition task where annotation is costly and error-prone.
Efficiency: By combining automated ranking with targeted human verification, the framework makes dataset cleaning feasible and cost-effective, avoiding the need for exhaustive manual review.
Reliability: The study demonstrates that "cleaner" benchmarks are essential for fair model comparison. It shows that a well-tuned CRNN can outperform complex Transformer models if the data is clean, suggesting that future progress depends as much on data curation as on model design.

The authors conclude that systematic dataset quality assessment is a prerequisite for reliable, reproducible, and fair HTR research.