Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Here is an explanation of the paper, "Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect," translated into simple, everyday language with some creative analogies.

The Big Idea: The "Lost in Translation" Problem

Imagine you have a very old, local recipe book written in a secret code that only your great-grandparents in a specific German town (Mainz) understand. This code is called Meenzerisch. It's the language of their carnival, their jokes, and their history.

Now, imagine you ask a super-smart, world-famous robot chef (a Large Language Model or LLM) to translate that recipe book into modern English (Standard German). You expect the robot to be amazing because it knows everything about the internet.

The bad news: The robot chef is completely lost. It doesn't just make a few mistakes; it barely understands the code at all.

This paper is the first time researchers tried to teach these super-smart robots about the Mainz dialect, and the results were a harsh reality check: The robots are failing miserably.

Step 1: Building the Dictionary (The "Digital Archaeology")

Before they could test the robots, the researchers had to build a dictionary. Since Meenzerisch is mostly spoken and rarely written down in digital formats, they had to do some "digital archaeology."

The Source: They found a physical book from 1966 called the Mainzer Wörterbuch.
The Process:
1. They took photos of the pages (Scanning).
2. They used software to turn the pictures into text (OCR), like a scanner trying to read handwriting.
3. Humans had to fix the scanner's mistakes (Manual Correction).
4. They used a smart AI to pull out the words and their meanings (Extraction).
The Result: They created a digital list of 2,351 words. Each entry has a word in the Mainz dialect and its meaning in standard German. Think of it as a Rosetta Stone for Mainz.

Step 2: The Two Big Tests

With their new dictionary, the researchers put the robots to work with two specific challenges.

Challenge A: The Translator (Definition Generation)

The Task: The robot is given a weird Mainz word (e.g., "Schwollescheer") and asked, "What does this mean in normal German?"
The Analogy: Imagine showing a robot a picture of a "Glocke" (a bell) but calling it a "Schwollescheer." You ask, "What is this?"
The Result: The robots were terrible.

The best robot (Llama-3.3) got it right only 6.27% of the time.
The average robot got it right less than 5% of the time.
Comparison: When asked to translate normal English words, these same robots got it right 86% of the time. They aren't "dumb"; they just don't know this specific dialect.

Challenge B: The Word Generator (Word Generation)

The Task: The robot is given a description in normal German (e.g., "A person who rides horses in a military unit") and asked, "What is the Mainz word for this?"
The Analogy: You tell the robot, "I need the word for 'hunger' in the Mainz dialect."
The Result: This was even worse.

The best robot got it right only 1.51% of the time.
Most robots were essentially guessing randomly, getting it right less than 1% of the time.

Step 3: Trying to Help the Robots (The "Cheat Sheets")

The researchers didn't give up. They tried two tricks to help the robots learn faster, like giving a student a cheat sheet before a test.

Few-Shot Learning (The "Example" Trick):
- They showed the robot a few examples first: "Here is a word and its meaning. Here is another. Now, guess this one."
- Result: It helped a tiny bit, but accuracy only went up to about 9%. It's like giving a student a few practice questions; they still don't understand the subject.
Rule Extraction (The "Grammar Book" Trick):
- They asked a super-smart AI to write down the rules of the dialect (e.g., "In Mainz, they often change 'en' to 'ele'"). They fed these rules to the other robots.
- Result: This helped slightly for translating words to meanings, but actually made things slightly worse for guessing the words from meanings. The robots still couldn't apply the rules correctly.

Why Does This Matter?

You might ask, "So what? Who cares if a robot knows a German dialect?"

Cultural Preservation: Dialects are dying out. If we want to save them, we need technology that understands them. If robots can't understand them, they can't help preserve them.
Bias: If a robot only understands "Standard" German, it treats people who speak dialects like they are "broken" or "wrong." This paper shows that the robots are currently biased against these speakers.
The "Low-Resource" Problem: The internet is full of data for English and Standard German. It is almost empty for dialects. The robots are like chefs who have read every cookbook in the world, except for the one from your hometown.

The Conclusion

The paper ends with a clear message: Current AI is not ready for local dialects.

Even the smartest, most expensive robots on the planet are failing to understand the language of Mainz. They need more data, more research, and a lot more help from the people who actually speak the language.

The Takeaway: Just because a robot is "smart" doesn't mean it knows your language. For the dialect of Mainz, the robot is currently as lost as a tourist without a map. As the title says: "Meenz bleibt Meenz" (Mainz remains Mainz), but the robots haven't figured out how to speak its dialect yet.

Here is a detailed technical summary of the paper "Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect".

1. Problem Statement

The paper addresses the critical lack of Natural Language Processing (NLP) support for Meenzerisch, a German dialect spoken in the city of Mainz and the surrounding Rhine-Hesse region. Despite its cultural significance (notably as the language of the Mainz Carnival), Meenzerisch is endangered due to language standardization, discrimination, and a lack of digital resources.

While Large Language Models (LLMs) have shown promise in preserving minority languages, they perform poorly on German dialects. The specific problem investigated is the extreme low-resource nature of Meenzerisch, which has no existing digital corpus, preventing current state-of-the-art LLMs from understanding or generating the dialect. The authors aim to determine if current LLMs can comprehend Meenzerisch words or generate them from definitions, and if standard NLP techniques (few-shot learning, rule extraction) can bridge this gap.

2. Methodology

A. Dataset Creation: The Mainz Dialect Dictionary

Since no digital dataset existed, the authors constructed the first NLP-ready dataset for Meenzerisch through a semi-automated pipeline:

Source Material: A physical copy of the Mainzer Wörterbuch (Schramm, 1966).
Digitization Pipeline:
- Scanning & OCR: Converted physical pages to PDFs and extracted text.
- Manual Correction: Fixed OCR errors, specifically missing line breaks which are crucial for separating words from definitions.
- Rule-Based Extraction: Used Python scripts to isolate word entries.
- LLM-Based Definition Extraction: Employed GPT-OSS 120B to extract clean Standard German definitions from unstructured text, filtering out examples and extra notes.
- Normalization: Used Llama-3.3 70B to clean OCR artifacts (hyphens, special characters).
Dataset Statistics:
- Total Valid Entries: 2,351 word-definition pairs.
- Split: 250 training, 250 development, 1,851 test samples (intentionally test-heavy for benchmarking).
- License: CC BY-NC-ND 4.0 (available for non-commercial research).

B. Experimental Setup

The authors benchmarked a diverse set of open-source and proprietary LLMs (ranging from 4B to 120B parameters), including Llama-3.3, GPT-OSS, Qwen, Gemma, and Leo-HessianAI.

Two primary tasks were evaluated:

Definition Generation (Understanding): Given a Meenzerisch word, generate its meaning in Standard German.
- Evaluation: LLM-as-a-Judge (using Llama-3.3 70B) to compare generated definitions against ground truth, supplemented by manual verification.
Word Generation (Production): Given a Standard German definition, generate the corresponding Meenzerisch word.
- Evaluation: Exact string match against the gold-standard dialect word.

C. Additional Experiments

To explore potential improvements, two additional strategies were tested on the best-performing model (Llama-3.3 70B):

Few-Shot Learning: Providing $k$ examples (1 to 50) in the prompt using either random selection or edit-distance-based selection.
Automatic Rule Extraction: Using DeepSeek-R1 to analyze the training set and generate explicit linguistic transformation rules (e.g., sound changes, suffix mappings), which were then injected into the inference prompt.

3. Key Contributions

First Dataset: Creation of the first digital dictionary for Meenzerisch containing 2,351 dialect words paired with Standard German definitions.
First Benchmark: The first study evaluating LLM capabilities on the Mainz dialect, establishing a baseline for future research.
Empirical Evidence of Failure: Demonstrated that current SOTA LLMs fail significantly on both comprehension and generation tasks for this dialect, even with advanced prompting techniques.

4. Results

Performance on Definition Generation (Understanding)

Zero-Shot Accuracy: Extremely low. The average accuracy across all models was 4.24%.
Best Model: Llama-3.3 70B achieved only 6.27%.
Comparison: In contrast, the same models achieved 86.89% accuracy on an equivalent English dictionary task, proving the models function correctly but lack dialect knowledge.

Performance on Word Generation (Production)

Zero-Shot Accuracy: Even lower than definition generation. The average accuracy was 0.56%.
Best Model: GPT-OSS 120B reached 1.51%, and Llama-3.3 70B reached 1.03%.
Comparison: English baseline accuracy was 58.77%, highlighting a massive gap in generative capability for the dialect.

Impact of Additional Techniques

Few-Shot Learning: Provided marginal improvements.
- Definition generation improved to 9.24% (with edit-distance selection).
- Word generation improved slightly to 1.24%.
Rule Extraction:
- Definition generation improved to 8.37%.
- Word generation accuracy actually decreased slightly (to 0.76%), though not statistically significant.
Conclusion: While these techniques offered slight gains, accuracy remained well below 10%, indicating that simple prompting is insufficient.

5. Significance and Conclusion

The "Dialect Gap": The study confirms that current LLMs, even large multilingual ones, possess a severe "dialect gap." They cannot infer dialectal meanings or generate dialectal forms without specific training data.
Cultural Preservation: The results highlight an urgent need for dedicated research and resource creation for German dialects. Relying on general-purpose LLMs is currently ineffective for preserving endangered dialects like Meenzerisch.
Future Directions: The authors argue that future work must move beyond prompting and toward:
- Creating larger, high-quality dialect corpora.
- Fine-tuning models specifically on dialect data.
- Avoiding solutions that rely solely on increased compute, which could disproportionately burden underrepresented cultures.
Ethical Note: The paper emphasizes active community engagement, including a native speaker as a co-author, and acknowledges the presence of sensitive terms in the dataset which were retained for linguistic authenticity.

In summary, the paper serves as a stark warning: "Meenz bleibt Meenz" (Mainz remains Mainz), but current AI technology does not yet speak its language. Significant, targeted effort is required to make NLP tools viable for this and similar low-resource dialects.