ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

This paper introduces ASCAT, a high-quality English-Arabic parallel corpus of scientific abstracts across five domains, rigorously validated by human experts and used to benchmark state-of-the-art machine translation models, thereby addressing the critical lack of resources for evaluating and improving scientific translation into Arabic.

Serry Sibaee, Khloud Al Jallad, Zineb Yousfi, Israa Elsayed Elhosiny, Yousra El-Ghawi, Batool Balah, Omer Nacar

Published 2026-04-03
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to translate a complex science textbook from English into Arabic. You give the robot a few simple sentences like "The cat sat on the mat," and it learns quickly. But then you hand it a full scientific abstract about quantum mechanics or artificial intelligence, and suddenly, the robot starts hallucinating, mixing up terms, or losing the meaning entirely.

That is the problem this paper, ASCAT, is trying to solve.

Here is the story of how they built a new tool to fix this, explained with some everyday analogies.

1. The Problem: The "Short Sentence" Trap

For a long time, the tools used to teach computers to translate Arabic were like training wheels. They used datasets full of very short sentences (like "The sky is blue") or simple titles.

  • The Analogy: Imagine trying to teach someone how to drive a Formula 1 car by only letting them practice in a parking lot at 5 mph. When you finally put them on the race track, they crash because they've never seen a real curve or high speed.
  • The Reality: Scientific papers are long, complex, and full of tricky jargon. Existing Arabic translation datasets were too short and simple to prepare AI for the real "race track" of scientific research.

2. The Solution: Building the "ASCAT" Track

The authors created ASCAT (Arabic Scientific Corpus for Advanced Translation). Think of this not as a giant library of books, but as a highly specialized, ultra-precise test track.

  • What is it? It's a collection of 500 full scientific abstracts (summaries of research papers) covering tough topics like Physics, Math, and AI.
  • The Size: It's not huge (only 500 items), but it's incredibly deep. The authors chose quality over quantity. They wanted a "gold standard" test, not just a pile of data.

3. How They Built It: The "Three-Chef" Kitchen

To make sure the translations were perfect, they didn't just ask one person to do the work. They used a "multi-engine" pipeline, which is like having three different master chefs cook the same dish to see who does it best.

  1. Chef 1 (Generative AI): Used a smart AI (Gemini) that is good at understanding context and nuance.
  2. Chef 2 (Transformer Models): Used a specialized AI model (from Hugging Face) that knows the grammar rules well.
  3. Chef 3 (Commercial APIs): Used big, famous translation tools (Google and DeepL) as a baseline.

The Human Taste-Test:
After the three "chefs" cooked up their translations, the real magic happened. Seven human experts (scientists and linguists) sat down to taste-test every single word.

  • They checked if the scientific terms were correct (e.g., did they translate "quantum entanglement" correctly?).
  • They checked if the grammar sounded natural in Arabic.
  • They fixed any mistakes until the dish was perfect.

This created a Human-Validated Reference, which is the "perfect recipe" that future AI models will be judged against.

4. Why Arabic is Tricky: The "Morphological Maze"

The paper highlights a fascinating quirk of the Arabic language.

  • The Analogy: English is like building with Lego bricks where you add one brick at a time to make a longer word. Arabic is like a magical clay that can stretch and shrink. One single root word can twist and turn into dozens of different forms depending on who is doing the action, when, and how.
  • The Result: Even though the Arabic translations were shorter in word count than the English ones, they had more unique words. It's like the Arabic language packs more information into fewer words, making it a "morphological maze" that AI finds very hard to navigate.

5. The Big Test: Who Wins the Race?

Once they built this perfect test track (ASCAT), they threw three of the world's most advanced AI models at it to see how they performed:

  1. GPT-4o-mini: The little engine that could. It won the race with the highest score.
  2. Gemini-3.0: Came in second. It got the main ideas right but missed some specific details.
  3. Qwen3: The giant model. Surprisingly, it came in last. Even though it's huge, it struggled with the specific style of scientific Arabic.

The Takeaway: The fact that there was a big gap between the winners and the losers proves that ASCAT is a tough, fair test. It can actually tell the difference between a good translator and a bad one.

6. Why This Matters

Currently, there is a "language gap" in science. Over 400 million Arabic speakers are cut off from the latest scientific discoveries because the translation tools aren't good enough.

ASCAT is the first step toward closing that gap. It provides a rigorous, expert-verified benchmark. It's not just a dataset; it's a ruler that scientists can use to measure how well their translation tools are actually working.

In short: The authors built a high-quality, expert-checked "exam" for AI translators. They proved that current AI is getting better, but still has a long way to go before it can perfectly translate complex science into Arabic. This paper gives the world the tools to measure that progress.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →