DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

Imagine you have a brilliant student who has spent years studying a very formal, textbook version of a language—let's call it "Standard Arabic." This student can ace any exam written in that formal style. They can solve complex math problems, explain history, and debate philosophy, but only if the questions are written in that perfect, polished textbook language.

Now, imagine you ask this student a simple question: "Hey, what's up?" but you say it in a casual, street-smart slang used by people in Cairo, or maybe a different slang used in Dubai. Suddenly, the student freezes. They might understand the words, but they miss the vibe, the nuance, and the specific cultural context. They might give a robotic answer or get the question completely wrong.

This is exactly the problem the paper "DIALECTALARABICMMLU" is trying to solve.

The Big Problem: The "Textbook" vs. The "Street"

Arabic is unique because of something called diglossia. This is a fancy word for a situation where a language has two very different forms living side-by-side:

Modern Standard Arabic (MSA): The formal, written language used in news, schools, and official documents.
Dialects: The spoken languages people actually use at home, on the street, and on social media (like Egyptian, Syrian, Saudi, Moroccan, etc.).

Most AI models (Large Language Models or LLMs) have been trained mostly on the "Textbook" version (MSA). They are great at formal tasks but often stumble when people speak naturally. It's like teaching someone to drive only on a race track, then expecting them to navigate a busy, chaotic city market without crashing.

The Solution: A New "Driver's Test"

The authors created a new benchmark (a test) called DIALECTALARABICMMLU. Think of this as a massive, multi-city driving test designed specifically to see how well AI can handle real-world Arabic.

Here's how they built it:

The Source: They took a famous English test called MMLU, which has thousands of questions on topics like history, science, law, and math.
The Translation: Instead of just using a computer to translate these questions (which often sounds robotic), they hired native speakers from five major regions (Syria, Egypt, UAE, Saudi Arabia, and Morocco) to translate the questions into their local dialects.
The Result: They created a massive dataset of over 15,000 questions (plus the original English and MSA versions). It's like having 15,000 different versions of the same exam, each written in a different local accent.

What They Discovered: The "Reality Check"

The researchers tested 19 different AI models (ranging from small to medium-sized) on this new test. The results were eye-opening:

The "Textbook" Gap: Almost every AI model performed significantly worse on the dialects than on the formal MSA or English. Even the smartest models struggled to understand the "street" version of the language.
The "Identity Crisis": They asked the AI: "Do you know what dialect you are reading?" Surprisingly, most models were terrible at identifying the dialect. They often couldn't tell the difference between Egyptian and Saudi Arabic, even though they were failing the questions in those dialects.
The "Hint" Didn't Help: They tried a trick: they told the AI, "This question is in Egyptian Arabic, so please answer in Egyptian style." They hoped this would help the AI focus. Instead, it often made the AI perform worse. It's like telling a nervous driver, "Remember, you're in a city!" and them getting even more confused.
Translation isn't a Magic Wand: They tried translating the dialect questions into English first, then letting the AI answer. This helped some models, but translating the dialect into formal Arabic (MSA) first didn't help much at all. It seems the AI loses the "soul" of the dialect when it tries to convert it to the formal version.

Why This Matters

This paper is a wake-up call for the AI world.

Inclusivity: If we only test AI on formal Arabic, we are ignoring the reality of how millions of people actually speak. We are building tools that work for the "textbook" but fail the "human."
Future Training: The authors argue that we need to stop treating dialects as a secondary afterthought. To build truly helpful AI for the Arab world, we need to train these models on the messy, beautiful, diverse reality of how people actually talk.

The Takeaway

Think of this paper as a report card for AI. It says: "You're an A+ student in the classroom (MSA), but you're failing the real world (Dialects)."

The authors have handed the AI community a new, fairer test that covers the whole spectrum of Arabic life. Their goal is to push developers to build models that don't just speak the language of the government, but the language of the people.

1. Problem Statement

Despite the rapid advancement of Large Language Models (LLMs) and the proliferation of Arabic benchmarks, current evaluation frameworks suffer from a critical limitation: they predominantly focus on Modern Standard Arabic (MSA).

The Gap: Arabic is a diglossic language where MSA coexists with diverse regional dialects (e.g., Egyptian, Syrian, Gulf, Moroccan) that differ significantly in morphology, syntax, and lexicon. These dialects dominate everyday communication, social media, and spoken interaction.
The Consequence: Existing benchmarks provide an incomplete picture of an LLM's real-world capabilities. Models may perform well on formal MSA but fail to reason, comprehend, or generate content in the dialects used by the majority of Arabic speakers.
Limitations of Prior Work: Previous dialectal benchmarks (e.g., AraDiCE, JEEM, Alyah) are either limited in scope (single dialect), rely heavily on machine translation with post-editing, or focus on specific tasks (like instruction tuning or vision-language) rather than general reasoning across multiple domains.

2. Methodology

The authors introduce DIALECTALARABICMMLU, a large-scale, human-curated benchmark designed to evaluate LLM reasoning and comprehension across five major Arabic dialects.

A. Dataset Construction

Source: The benchmark is built upon MMLU-Redux, a high-quality re-annotated subset of the original MMLU dataset.
Scope:
- Domains: 32 academic and professional domains (Humanities, STEM, Social Sciences, Other).
- Dialects: Five major dialects: Egyptian (EGY), Saudi (KSA), Moroccan (MAG), Syrian (SYR), and Emirati (UAE).
- Languages: Includes MSA and English as baselines.
- Volume: 3,135 unique question-answer pairs per dialect, totaling 15,675 dialectal QA pairs (plus MSA and English, totaling 21,945 pairs).
Translation Process:
- Human-Curated: Unlike previous works relying on MT, this dataset was translated manually by native speakers.
- Quality Assurance: A rigorous three-stage process involving Translators, Reviewers, and Adjudicators.
- Guidelines: Translators were instructed to prioritize naturalness (conversational tone), correctness (semantic fidelity), and simplicity. They were explicitly told to avoid over-dialectalization and to use MSA terms where contextually appropriate.
- Validation: A sampling of 32 QA pairs per dialect was scored by native speakers on a 1–5 Likert scale. The average score was between "Good" (4) and "Very Good" (5), with only ~5% containing inaccuracies. A second round of revision was conducted for UAE and Syrian pairs to address specific biases (e.g., Saudi word usage in UAE translations).

B. Experimental Setup

Models Evaluated: 19 open-weight Arabic and multilingual LLMs ranging from 1B to 13B parameters. This includes Arabic-centric models (Jais, ALLaM, Fanar, Nile-Chat) and multilingual models (Gemma, Mistral, Command R).
Evaluation Settings:
1. Default: Standard MMLU prompt without dialect cues.
2. Oracle: The prompt explicitly specifies the dialect (e.g., "questions in the Egyptian dialect") to test if explicit conditioning helps.
3. Dialect Identification: The model is asked to identify the dialect of the input text.
Metrics: Accuracy (exact match) across 32 domains, averaged over 5 runs.

3. Key Contributions

First Unified Dialectal Benchmark: The first large-scale, human-curated resource specifically designed to measure reasoning and comprehension across five major Arabic dialects simultaneously.
High-Quality Data: A dataset of 15K+ QA pairs produced entirely by native speakers with rigorous quality control, ensuring linguistic authenticity and idiomatic precision.
Comprehensive Evaluation: Systematic assessment of 19 diverse LLMs under three distinct experimental settings to isolate the impact of dialectal variation.
Empirical Insights: Detailed analysis revealing that dialectal performance is not merely a function of model size or MSA proficiency, highlighting specific gaps in current model architectures.

4. Key Results and Analysis

A. Performance Gap: MSA/English vs. Dialects

Significant Drop: All evaluated models show a substantial performance decline when moving from English/MSA to dialects.
- Example: The average accuracy for English is 62.8%, while the average for dialects (DA Avg) drops to 47.7%.
- Even the best-performing models (e.g., Gemma-3-12B) drop from ~73.7% (English) to ~58.7% (Dialects).
No Correlation with Size: Larger models do not necessarily perform better on dialects; in some cases, smaller models outperform larger ones, suggesting that training data composition is more critical than parameter count.

B. Dialect Identification vs. QA Performance

Weak Correlation: There is a moderate but statistically insignificant correlation ( $r=0.431, p=0.07$ ) between a model's ability to identify a dialect and its ability to answer questions in that dialect.
Identification Failure: Most models perform poorly on dialect identification tasks, often scoring near random chance (16.7%) for specific dialects like KSA and UAE.
Oracle Setting Failure: Explicitly telling the model the dialect (Oracle setting) did not improve performance. In fact, it caused a statistically significant drop in accuracy for most models. This suggests that simply injecting a dialect label is insufficient if the model lacks the underlying linguistic representation for that dialect.

C. Machine Translation (MT) as a Mitigation Strategy

Dialect $\to$ English: Translating dialectal questions to English significantly improved performance for multilingual models (e.g., Mistral-7B and Qwen3-4B saw gains of ~18–27 points). This indicates these models have strong English reasoning but weak dialectal understanding.
Dialect $\to$ MSA: Translating dialects to MSA failed to improve performance (average gain ~0.3 points). This suggests that translation errors from dialect to MSA compound the problem, or that the models struggle with MSA reasoning when the context is derived from a dialect translation.

5. Significance and Future Work

Paradigm Shift: The paper challenges the assumption that MSA proficiency equates to Arabic proficiency. It demonstrates that "Arabic-enabled" models often lack true dialectal understanding.
Resource for Development: DIALECTALARABICMMLU provides a standardized, reproducible framework for researchers to develop dialect-aware training strategies and fine-tuning techniques.
Limitations & Future Directions:
- The current benchmark covers only five major dialects, missing rural, urban, and sociolectal variations.
- Future work aims to expand coverage to low-resource varieties, add generative/conversational tasks, and explore multimodal dialectal understanding.
- The authors call for dedicated training data and architectures that explicitly model the diversity of the Arabic linguistic landscape rather than treating it as a monolith.

Conclusion: DIALECTALARABICMMLU reveals a persistent and significant gap in the capabilities of current LLMs regarding Arabic dialects. It establishes that current models rely heavily on MSA and English, failing to generalize reasoning skills to the dialects spoken by millions, and underscores the urgent need for dialect-specific data and evaluation protocols.