MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Imagine you are a global restaurant chain. You have a master chef (the AI) who can cook amazing dishes in English. But now, you want to open branches in 47 different countries. You don't just want the food to be edible; you want it to taste exactly like a dish made by a local grandmother in that specific village. It needs to use the right local spices, respect local customs, and feel like it was cooked by someone who grew up there.

This paper, titled "From Preferences to Proficiency," is about building a system to teach AI how to cook like a local in 47 different languages, not just translate English recipes.

Here is the breakdown of their journey, using simple analogies:

1. The Problem: The "Tourist" AI

Currently, most AI models are like tourists. They can speak the language, but they sound like outsiders. They might use the wrong slang, miss cultural nuances, or give advice that doesn't fit the local context.

Example: If you ask an AI in Mexico how to politely ask for seconds at a family dinner, a "tourist" AI might give a generic answer. A "native" AI would know exactly how to say it with the right warmth and respect for that specific region.

2. The Solution: The "MENLO" Framework

The researchers built a new system called MENLO (Multilingual Evaluation of Native-Like Output). Think of MENLO as a super-tasting panel and a training manual rolled into one.

They broke down "native-like quality" into four specific flavors:

Fluency: Is the sentence smooth, or does it sound like a robot stuttering?
Tone: Is the voice friendly, professional, or funny? Does it fit the mood?
Localized Tone: Does it sound like it belongs in this specific neighborhood? (e.g., using the right slang for Buenos Aires vs. Madrid).
Localized Factuality: Does it know the local facts? (e.g., knowing which holidays are celebrated in that specific country).

3. The Dataset: The "Taste Test"

To train the AI, they didn't just ask it to guess. They hired native speakers from 47 different language varieties (like Brazilian Portuguese vs. European Portuguese, or Indian English vs. US English).

The Process: They created 6,423 scenarios (prompts) and had the AI generate two different answers for each.
The Judges: Real humans tasted both answers and rated them on a scale of 1 to 5.
The Result: A massive "Taste Test" dataset where humans decided which AI response sounded more "native."

4. The Discovery: The "Side-by-Side" Trick

The researchers asked: Can we replace the expensive human judges with an AI judge?

They found a clever trick. If you ask an AI to grade one answer alone, it often gets confused (like judging a single song without hearing the other). But if you put two answers side-by-side and ask, "Which one is better?", the AI becomes a much sharper judge.

Analogy: It's like trying to guess the weight of a single apple in the dark. It's hard. But if you hold two apples and ask, "Which is heavier?", your brain instantly knows. The AI works the same way.

5. The Training: From "Student" to "Master Chef"

Even with the side-by-side trick, the AI judges weren't perfect. So, the researchers used a technique called Reinforcement Learning (RL).

The Analogy: Imagine a cooking student (the AI) trying to learn.
- Old Way (SFT): The teacher just shows the student the right recipe once.
- New Way (RL): The student tries to cook, the teacher tastes it, gives feedback, and the student tries again, adjusting the spices slightly each time until it's perfect.
The Result: By using this "taste-and-adjust" method, they trained an AI judge that was almost as good as a human native speaker.

6. The Final Twist: The "Self-Improving" Loop

Here is the coolest part. They took their newly trained "Master AI Judge" and used it to teach the original AI chef how to cook better.

The AI Judge acts as a personal trainer. It watches the AI chef cook, says, "No, that sounds too formal for a Mexican family dinner," and the chef tries again.
The Catch: The AI Judge is a bit overconfident. It thinks the chef improved by 30%, but the human tasters only noticed a 10% improvement. The AI is like a hype-man who loves the performance a little too much!

Summary

This paper is a roadmap for making AI feel human in every corner of the world.

They built a menu of what "native" sounds like (MENLO).
They proved that comparing two answers is the best way to judge quality.
They trained an AI to be a super-judge using a "taste-and-adjust" method.
They used that judge to teach the AI to speak like a local, not a tourist.

The ultimate goal? To ensure that when you talk to an AI in any of these 47 languages, it feels like you're chatting with a neighbor, not a machine reading a dictionary.

1. Problem Statement

Large Language Models (LLMs) struggle to generate responses that are indistinguishable from those of a native speaker across diverse languages. Existing evaluations often rely on:

Standardized proficiency tests: Difficult to scale and poorly aligned with real-world conversational contexts.
Translation-based benchmarks: Often suffer from "translationese" and fail to capture localized cultural nuances, idioms, and social context.
Lack of granularity: Prior work often focuses on single dimensions (e.g., fluency or factuality) rather than a holistic view of "native-like" quality.

The core challenge is operationalizing "native-like quality" in a way that is scalable, culturally grounded, and measurable across a massive number of language varieties.

2. Methodology: The MENLO Framework

The authors introduce MENLO (Multilingual Evaluation of Native-Like Output), a framework grounded in Audience Design theory (Bell, 1984). This theory posits that language style is socially conditioned based on the interlocutor.

A. Dataset Construction

Scope: 47 language varieties (including major variants like Brazilian vs. European Portuguese, and different English locales).
Dimensions: Native-like quality is broken down into four specific dimensions:
1. Fluency: Coherence, grammar, and clarity.
2. Tone: Overall writing style, helpfulness, and engagement.
3. Localized Tone: Alignment with cultural/regional nuances, formality, and politeness.
4. Localized Factuality: Grounding in local context, cultural practices, and local knowledge.
Prompt Engineering: Human-written parametric templates in English are translated and localized. Placeholders (e.g., [locale_country], [locale_holiday]) are instantiated by native speakers to evoke specific local contexts.
Annotation:
- Scale: 6,423 prompt-response pairs (81,014 total annotations).
- Process: Two responses are generated per prompt by state-of-the-art LLMs (e.g., GPT-4o, Llama4). Human annotators rate them on a 1–5 Likert scale and provide preferences.
- Quality Control: High inter-annotator agreement (IAA) achieved with an average Krippendorff's $\alpha$ of 0.84.

B. Evaluation of LLM Judges

The paper investigates whether LLMs can serve as reliable judges for native-like quality:

Pointwise vs. Pairwise: Comparing grading single responses (Pointwise) vs. grading pairs simultaneously (Pairwise).
Zero-shot vs. Few-shot: Testing models with and without in-context examples.
Rubric Impact: Evaluating the effect of providing detailed 5-point grading rubrics.

C. Training Strategies

To bridge the gap between zero-shot LLM judges and human annotators, the authors employ:

Supervised Fine-Tuning (SFT): Direct prediction of grades.
Reinforcement Learning (RL): Using GRPO (Group Relative Policy Optimization) with a composite reward signal.
- Reward Design: Combines pointwise binary accuracy, reward smoothing (partial credit for near-misses), and a preference bonus (correctly ranking the pair).
Multi-task Learning: Training on all four dimensions simultaneously vs. single-task training.

D. Generative Reward Modeling

The trained RL judges are used as Generative Reward Models (RMs) to post-train a policy model (Qwen3-4B) to improve its native-like proficiency directly.

3. Key Contributions

MENLO Dataset & Framework: A high-quality, 47-language dataset with high IAA, operationalizing native-like quality through four sociolinguistically informed dimensions.
Pairwise Evaluation Superiority: Demonstrated that zero-shot pairwise evaluation significantly outperforms pointwise evaluation (even with few-shot examples), providing stronger relative signals for LLM judges.
RL-Optimized Judges: Showed that Reinforcement Learning with reward shaping produces judges that outperform SFT counterparts and match human annotator agreement levels, particularly in multi-task settings.
Generative Reward Modeling: Proved that pairwise RL-trained judges can serve as effective reward models to directly improve the native-like quality of policy models.
Discrepancy Analysis: Identified that while LLM judges can drive improvements, they tend to overestimate the magnitude of quality gains compared to human raters.

4. Key Results

Evaluation Performance

Pairwise vs. Pointwise: Pairwise evaluation yielded gains of up to +12.4% in Macro-F1 and +18.0% in Preference accuracy over zero-shot pointwise baselines.
Rubrics: Detailed rubrics improved pointwise performance significantly (+4.3% Macro-F1) but had a smaller effect on pairwise evaluation, suggesting pairwise comparison itself acts as a strong grounding signal.
Training Impact:
- RL vs. SFT: RL-trained models consistently outperformed SFT models. For Qwen3-4B, RL improved Macro-F1 by +4.0% and Preference accuracy by +2.9% over SFT.
- Multi-task vs. Single-task: Multi-task RL training achieved performance comparable to single-task optimization while offering greater efficiency.
- State-of-the-Art: The best RL-trained model (Multi-task Llama4-Scout) surpassed frontier API models (like gpt-4.1) in overall performance across 47 languages.

Reward Modeling & Policy Improvement

Using the RL-trained judge as a reward model to post-train Qwen3-4B resulted in significant quality gains:
- LLM Judges: Reported win rates of 63–78% for the post-trained model.
- Human Raters: Confirmed the trend but reported a lower win rate (55.7%) and smaller score improvement (+0.36 vs. +0.78 by LLMs).
Overestimation Gap: LLM judges overestimated improvements by approximately +0.5 on the score scale compared to humans, indicating a need for better calibration in automated evaluation.

Dimension-Specific Challenges

Localized Factuality remained the most difficult dimension for all models (zero-shot and trained), suggesting that current LLMs struggle with deep local knowledge grounding without external tools (e.g., search).

5. Significance and Future Directions

Scalability: MENLO provides a scalable, automated approach to evaluating and improving multilingual LLMs, moving beyond translation-based metrics to true native-like proficiency.
Methodological Shift: The paper establishes pairwise evaluation and RL with reward shaping as critical components for building reliable multilingual judges.
Practical Application: The framework unifies evaluation and optimization, allowing developers to use trained judges as reward models to directly align models with native-like standards.
Limitations & Future Work: The persistent gap between LLM and human judgment (overestimation of gains) and the difficulty of localized factuality highlight areas for future research, potentially involving retrieval-augmented generation (RAG) and better calibration techniques for automated evaluators.

The dataset and framework are open-sourced to support further research in multilingual LLM evaluation.