CER-HV: A Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

This paper introduces CER-HV, a human-in-the-loop framework that effectively identifies and cleans label errors in Arabic-script handwritten text recognition datasets, thereby revealing significant data quality issues and improving recognition performance across multiple languages.

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to read handwritten letters from history. You want it to read Arabic, Persian, Urdu, and other languages written in the Arabic script. You give the robot a massive stack of handwritten notes and say, "Here, learn from these!"

But there's a problem. The stack of notes you gave the robot is messy. Some pages are upside down. Some have stamps or signatures drawn right over the text. Some lines are cut off in the middle. And worst of all, some of the notes have been transcribed (typed out) with the wrong words.

If you let the robot study this messy stack, it will get confused. It might think that upside-down text is normal, or that a stamp is part of a word. It will learn the mistakes instead of the language.

This paper is about a new method called CER-HV to fix this mess before the robot starts its final exam.

The Problem: The "Garbage In, Garbage Out" Trap

For a long time, researchers thought the reason robots were bad at reading Arabic handwriting was that the language itself is too hard. Arabic letters change shape depending on where they are in a word, and they are all connected like a snake.

The authors of this paper said, "Wait a minute. Maybe the language isn't the problem. Maybe the textbooks we are giving the robots are full of errors."

They looked at six popular datasets (collections of handwritten text used for training) and found they were indeed full of hidden errors:

  • Transcription Errors: The typed text didn't match the handwriting.
  • Segmentation Errors: Two different lines of text were glued together into one image.
  • Orientation Errors: The text was rotated 90 or 180 degrees.
  • Script Mismatch: The text was actually in a different language (like Latin letters) but labeled as Arabic.

The Solution: The "Smart Librarian" (CER-HV)

The authors built a framework called CER-HV (Character Error Rate-based Ranking with Human Verification). Think of it as a Smart Librarian who helps clean the library before the students (the AI models) start studying.

Here is how the Smart Librarian works in two steps:

Step 1: The Robot's "Stumble Test" (Automated Scoring)

First, the authors train a basic robot (a CRNN model) on the messy data. They don't expect it to be perfect yet. They just want to see where it stumbles.

  • The Analogy: Imagine you give a student a practice test. If they get a question wrong, it could be because the question is too hard, OR it could be because the answer key is wrong.
  • The Trick: The authors realized that if a robot makes a huge mistake on a specific line of text, it's a strong signal that something is wrong with that line. They calculate a "Stumble Score" (called CER) for every single line of text.
  • The Filter: They sort the entire library by this score. The lines where the robot stumbled the most go to the top of the pile. These are the "suspects."

Step 2: The Human Detective (Human Verification)

Now, the robot is good at finding suspects, but it's not good at judging why they are suspects. Sometimes the text is just really messy handwriting (a "hard" sample), not a mistake.

  • The Analogy: This is where a human detective steps in. The human only looks at the top 10% of the "stumble pile" (the ones with the highest scores).
  • The Decision: The human looks at the image and the label and asks: "Is the label wrong? Is the image upside down? Is there a stamp on it?"
    • If yes: They fix it or throw it out.
    • If no: They keep it, realizing it was just a very difficult piece of handwriting.

The Results: Cleaning Up the Classroom

When the authors used this "Smart Librarian" to clean the datasets, the results were amazing:

  1. The Robot Got Smarter: When they retrained the robot on the clean data, it got significantly better at reading. On some datasets, the error rate dropped by nearly 2%. In the world of AI, that's a huge jump.
  2. The Baselines Were Wrong: They found that previous "best" scores for these languages were actually based on messy data. Once they cleaned the data, the "new best" scores were actually lower (better) than anyone thought possible.
  3. A Simple Model Won: They also showed that you don't need a super-complex, expensive AI model to get great results. A well-tuned, simpler model (CRNN) performed just as well as the fancy, complex ones once the data was clean.

The Big Takeaway

The main lesson of this paper is simple: Don't blame the student if the textbook is broken.

For years, researchers tried to build bigger, smarter, more complex AI models to solve the "hard problem" of Arabic handwriting. This paper shows that the real problem was the data. By using a simple, two-step process (let the robot find the weird stuff, then have a human check it), they cleaned up the datasets and made the AI much smarter.

It's a reminder that in the age of AI, data quality is just as important as the model itself. You can have the best engine in the world, but if you put muddy fuel in it, the car won't run. CER-HV is just the filter that cleans the fuel.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →