MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Imagine you have a giant, global library called Wikipedia. It has books written in over 300 different languages, from English and Spanish to tiny, rare languages spoken by only a few thousand people.

For a long time, if you wanted to test how smart a computer was at reading and understanding these books, you mostly had to test it on English. It was like only having a driving test for cars on a sunny day in California, and then assuming those cars could handle driving in the snow in Siberia or the monsoons in India.

This paper introduces MultiWikiQA, a massive new "driving test" for computers that covers 306 languages. Here's how they built it and what they found, explained simply:

1. The Recipe: How They Made the Test

The researchers didn't write these questions by hand (that would take forever!). Instead, they used a super-smart AI (a Large Language Model) to act like a creative chef.

The Ingredients: They took a Wikipedia article (the "context").
The Cooking: They asked the AI to chop up the article and create a list of questions and answers, like a quiz.
The Safety Check: They made sure the answers were actually in the text, word-for-word.
The "Anti-Cheat" Sauce: This is the clever part. Sometimes, computers are lazy; if the question says "What is the capital of France?" and the text says "Paris is the capital," the computer just matches the words "Paris" and "capital." To stop this, the researchers asked the AI to rewrite the questions using different words and sentence structures. It's like changing a math problem from "2 + 2" to "What is the sum of two pairs?" so the computer has to actually think rather than just copy-paste.

2. The Taste Test: Did the Questions Sound Natural?

Just because a computer wrote a question doesn't mean it sounds like something a human would ask. It might sound robotic or weird.

To fix this, the researchers hired 156 real humans from 30 different languages (both big languages like French and small ones like Icelandic) to taste-test the questions. They asked: "Does this sound like a question a real person would ask?"

The Result: The questions scored very high. Even for the smallest languages, the questions sounded "mostly natural." It proved that the AI chef did a good job cooking up human-like quizzes.

3. The Race: Who Won the Driving Test?

Once the test was ready, they ran 6 different computer models through it. These models ranged from small, basic ones to huge, advanced ones.

The Big Discovery:
The test was hard. Even the smartest computers struggled.

The Gap: There was a huge difference in performance between languages. The computers were like marathon runners who had trained for years on English tracks but were suddenly dropped into a jungle for languages they didn't know well.
High-Resource Languages: For languages like English, German, or Spanish, the computers did pretty well (scoring around 70-80% accuracy).
Low-Resource Languages: For languages with fewer digital books (like some African or indigenous languages), the computers often scored very low, sometimes barely getting any answers right.

Why Does This Matter?

Think of AI as a new student in a global school.

Before this paper: The school only had textbooks in English. The student was great at English but couldn't read the other 300+ languages.
After this paper: The school finally has textbooks in 306 languages. We now have a fair way to see which students (AI models) are actually good at reading the whole world, and which ones are still struggling with the smaller languages.

The Bottom Line:
This paper gives us a giant, fair playground to test how well AI understands the world's languages. It shows us that while AI is getting smarter, it still has a long way to go to truly understand everyone, everywhere, equally. It's a wake-up call that we need to build better "brains" for the languages that have been left behind.

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

1. The Recipe: How They Made the Test

2. The Taste Test: Did the Questions Sound Natural?

3. The Race: Who Won the Driving Test?

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Data Source and Generation

B. Quality Evaluation (Human-in-the-Loop)

C. Model Evaluation

3. Key Contributions

4. Results

5. Significance and Limitations

Conclusion

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

1. The Recipe: How They Made the Test

2. The Taste Test: Did the Questions Sound Natural?

3. The Race: Who Won the Driving Test?

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Data Source and Generation

B. Quality Evaluation (Human-in-the-Loop)

C. Model Evaluation

3. Key Contributions

4. Results

5. Significance and Limitations

Conclusion

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models