DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

This paper introduces DialectalArabicMMLU, a comprehensive benchmark comprising 15,000 manually curated multiple-choice questions across five major Arabic dialects and 32 domains, designed to evaluate and reveal significant performance gaps in the dialectal reasoning capabilities of current large language models.

Malik H. Altakrori, Nizar Habash, Abed Alhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji

Published 2026-03-24
📖 4 min read☕ Coffee break read

Imagine you have a brilliant student who has spent years studying a very formal, textbook version of a language—let's call it "Standard Arabic." This student can ace any exam written in that formal style. They can solve complex math problems, explain history, and debate philosophy, but only if the questions are written in that perfect, polished textbook language.

Now, imagine you ask this student a simple question: "Hey, what's up?" but you say it in a casual, street-smart slang used by people in Cairo, or maybe a different slang used in Dubai. Suddenly, the student freezes. They might understand the words, but they miss the vibe, the nuance, and the specific cultural context. They might give a robotic answer or get the question completely wrong.

This is exactly the problem the paper "DIALECTALARABICMMLU" is trying to solve.

The Big Problem: The "Textbook" vs. The "Street"

Arabic is unique because of something called diglossia. This is a fancy word for a situation where a language has two very different forms living side-by-side:

  1. Modern Standard Arabic (MSA): The formal, written language used in news, schools, and official documents.
  2. Dialects: The spoken languages people actually use at home, on the street, and on social media (like Egyptian, Syrian, Saudi, Moroccan, etc.).

Most AI models (Large Language Models or LLMs) have been trained mostly on the "Textbook" version (MSA). They are great at formal tasks but often stumble when people speak naturally. It's like teaching someone to drive only on a race track, then expecting them to navigate a busy, chaotic city market without crashing.

The Solution: A New "Driver's Test"

The authors created a new benchmark (a test) called DIALECTALARABICMMLU. Think of this as a massive, multi-city driving test designed specifically to see how well AI can handle real-world Arabic.

Here's how they built it:

  • The Source: They took a famous English test called MMLU, which has thousands of questions on topics like history, science, law, and math.
  • The Translation: Instead of just using a computer to translate these questions (which often sounds robotic), they hired native speakers from five major regions (Syria, Egypt, UAE, Saudi Arabia, and Morocco) to translate the questions into their local dialects.
  • The Result: They created a massive dataset of over 15,000 questions (plus the original English and MSA versions). It's like having 15,000 different versions of the same exam, each written in a different local accent.

What They Discovered: The "Reality Check"

The researchers tested 19 different AI models (ranging from small to medium-sized) on this new test. The results were eye-opening:

  1. The "Textbook" Gap: Almost every AI model performed significantly worse on the dialects than on the formal MSA or English. Even the smartest models struggled to understand the "street" version of the language.
  2. The "Identity Crisis": They asked the AI: "Do you know what dialect you are reading?" Surprisingly, most models were terrible at identifying the dialect. They often couldn't tell the difference between Egyptian and Saudi Arabic, even though they were failing the questions in those dialects.
  3. The "Hint" Didn't Help: They tried a trick: they told the AI, "This question is in Egyptian Arabic, so please answer in Egyptian style." They hoped this would help the AI focus. Instead, it often made the AI perform worse. It's like telling a nervous driver, "Remember, you're in a city!" and them getting even more confused.
  4. Translation isn't a Magic Wand: They tried translating the dialect questions into English first, then letting the AI answer. This helped some models, but translating the dialect into formal Arabic (MSA) first didn't help much at all. It seems the AI loses the "soul" of the dialect when it tries to convert it to the formal version.

Why This Matters

This paper is a wake-up call for the AI world.

  • Inclusivity: If we only test AI on formal Arabic, we are ignoring the reality of how millions of people actually speak. We are building tools that work for the "textbook" but fail the "human."
  • Future Training: The authors argue that we need to stop treating dialects as a secondary afterthought. To build truly helpful AI for the Arab world, we need to train these models on the messy, beautiful, diverse reality of how people actually talk.

The Takeaway

Think of this paper as a report card for AI. It says: "You're an A+ student in the classroom (MSA), but you're failing the real world (Dialects)."

The authors have handed the AI community a new, fairer test that covers the whole spectrum of Arabic life. Their goal is to push developers to build models that don't just speak the language of the government, but the language of the people.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →