Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Imagine you are a teacher trying to grade a new student's essay. But instead of reading the essay yourself, you ask a robot to translate the essay into your language, grade it, and then tell you how the student did.

This paper is essentially a report card on how we are currently testing AI (Large Language Models) in Icelandic, a language spoken by a relatively small number of people. The authors, who are experts in Icelandic and AI, are raising a huge red flag: Many of the tests we are using are broken, and the broken tests are giving us a false picture of how smart these AIs really are.

Here is the breakdown of their findings using some everyday analogies:

1. The "Copy-Paste" Problem (Machine Translation)

Currently, because creating tests from scratch is hard and expensive, many researchers take famous English tests (like a 5th-grade science quiz) and just run them through a machine translator to make them Icelandic.

The Analogy: Imagine you want to test a student's knowledge of Icelandic history. Instead of writing a test in Icelandic, you take a test about American history, run it through Google Translate, and hand it to the student.
The Result: The translation might be full of weird errors. A "turkey" (the bird) might get translated as "Turkey" (the country). A famous scientist's name might get changed to a random local name.
The Paper's Finding: The authors found that many of these "translated" tests are so full of errors that they are useless. It's like giving a math test where the numbers are written in a language the student doesn't understand. If the AI gets the answer right, it might just be guessing because the question was nonsense, not because it actually understands Icelandic.

2. The "Fake News" Factory (Synthetic Data)

Some tests aren't just translated; they are generated entirely by other AIs. An AI is asked to "make up 1,000 questions about Iceland."

The Analogy: This is like asking a student to write their own history textbook, and then using that textbook to grade them. The student might make up facts that sound plausible but are completely wrong.
The Paper's Finding: When AIs generate these questions, they often hallucinate (make things up). They might create a question about a person who never existed or a fact that is scientifically impossible. The authors found that some of these AI-generated tests were almost 100% flawed.

3. The "Native Speaker" Gap

The most successful tests in the paper were the ones where actual humans who speak Icelandic wrote or checked the questions.

The Analogy: This is like having a real Icelandic teacher write the test. They know the local culture, the slang, the correct grammar, and the history. They don't rely on a robot to guess what a sentence means.
The Paper's Finding: Tests made by humans were much cleaner and more accurate. Tests made by machines (or translated by them) were full of "translationese" (awkward phrasing that no real human would ever say).

4. Why Does This Matter? (The "Garbage In, Garbage Out" Rule)

The authors argue that if we keep using these broken tests, we are lying to ourselves about how good AI is.

The Analogy: Imagine you are training a dog. If you use a broken clicker that makes a noise every time the dog sits, even if the dog is just standing still, the dog will think it's doing something right. Eventually, the dog will be confused and won't actually learn to sit.
The Real Consequence: If AI models are trained or evaluated on these broken tests, they might learn to "game the system." They might learn to spot the weird translation errors and answer based on those errors rather than actually understanding the language. This makes the AI look smarter than it is, but in the real world, it will fail at actual tasks.

The Main Takeaway

The paper asks a simple question: "Who is checking the people who make the tests?"

Right now, the answer is often "no one," or "a robot." The authors are calling for a change:

Stop blindly translating tests. If you translate a test, a human native speaker must check it to make sure it makes sense.
Involve real humans. Native speakers need to be part of the process to ensure the tests reflect real culture and language, not just robotic translations.
Quality over Quantity. It's better to have a small, perfect test than a huge, broken one.

In short: We can't trust the scoreboard if the referees are robots that don't speak the language. We need human experts to ensure the tests are fair and accurate before we declare an AI a genius.

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

1. The "Copy-Paste" Problem (Machine Translation)

2. The "Fake News" Factory (Synthetic Data)

3. The "Native Speaker" Gap

4. Why Does This Matter? (The "Garbage In, Garbage Out" Rule)

The Main Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Recommendations

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

1. The "Copy-Paste" Problem (Machine Translation)

2. The "Fake News" Factory (Synthetic Data)

3. The "Native Speaker" Gap

4. Why Does This Matter? (The "Garbage In, Garbage Out" Rule)

The Main Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Recommendations

More like this

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context