MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

The paper introduces MENLO, a framework and dataset for evaluating native-like LLM response quality across 47 languages, demonstrating that while zero-shot judges fall short of human annotators, reinforcement learning and reward shaping can significantly improve both evaluation accuracy and multilingual model proficiency.

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are a global restaurant chain. You have a master chef (the AI) who can cook amazing dishes in English. But now, you want to open branches in 47 different countries. You don't just want the food to be edible; you want it to taste exactly like a dish made by a local grandmother in that specific village. It needs to use the right local spices, respect local customs, and feel like it was cooked by someone who grew up there.

This paper, titled "From Preferences to Proficiency," is about building a system to teach AI how to cook like a local in 47 different languages, not just translate English recipes.

Here is the breakdown of their journey, using simple analogies:

1. The Problem: The "Tourist" AI

Currently, most AI models are like tourists. They can speak the language, but they sound like outsiders. They might use the wrong slang, miss cultural nuances, or give advice that doesn't fit the local context.

  • Example: If you ask an AI in Mexico how to politely ask for seconds at a family dinner, a "tourist" AI might give a generic answer. A "native" AI would know exactly how to say it with the right warmth and respect for that specific region.

2. The Solution: The "MENLO" Framework

The researchers built a new system called MENLO (Multilingual Evaluation of Native-Like Output). Think of MENLO as a super-tasting panel and a training manual rolled into one.

They broke down "native-like quality" into four specific flavors:

  1. Fluency: Is the sentence smooth, or does it sound like a robot stuttering?
  2. Tone: Is the voice friendly, professional, or funny? Does it fit the mood?
  3. Localized Tone: Does it sound like it belongs in this specific neighborhood? (e.g., using the right slang for Buenos Aires vs. Madrid).
  4. Localized Factuality: Does it know the local facts? (e.g., knowing which holidays are celebrated in that specific country).

3. The Dataset: The "Taste Test"

To train the AI, they didn't just ask it to guess. They hired native speakers from 47 different language varieties (like Brazilian Portuguese vs. European Portuguese, or Indian English vs. US English).

  • The Process: They created 6,423 scenarios (prompts) and had the AI generate two different answers for each.
  • The Judges: Real humans tasted both answers and rated them on a scale of 1 to 5.
  • The Result: A massive "Taste Test" dataset where humans decided which AI response sounded more "native."

4. The Discovery: The "Side-by-Side" Trick

The researchers asked: Can we replace the expensive human judges with an AI judge?

They found a clever trick. If you ask an AI to grade one answer alone, it often gets confused (like judging a single song without hearing the other). But if you put two answers side-by-side and ask, "Which one is better?", the AI becomes a much sharper judge.

  • Analogy: It's like trying to guess the weight of a single apple in the dark. It's hard. But if you hold two apples and ask, "Which is heavier?", your brain instantly knows. The AI works the same way.

5. The Training: From "Student" to "Master Chef"

Even with the side-by-side trick, the AI judges weren't perfect. So, the researchers used a technique called Reinforcement Learning (RL).

  • The Analogy: Imagine a cooking student (the AI) trying to learn.
    • Old Way (SFT): The teacher just shows the student the right recipe once.
    • New Way (RL): The student tries to cook, the teacher tastes it, gives feedback, and the student tries again, adjusting the spices slightly each time until it's perfect.
  • The Result: By using this "taste-and-adjust" method, they trained an AI judge that was almost as good as a human native speaker.

6. The Final Twist: The "Self-Improving" Loop

Here is the coolest part. They took their newly trained "Master AI Judge" and used it to teach the original AI chef how to cook better.

  • The AI Judge acts as a personal trainer. It watches the AI chef cook, says, "No, that sounds too formal for a Mexican family dinner," and the chef tries again.
  • The Catch: The AI Judge is a bit overconfident. It thinks the chef improved by 30%, but the human tasters only noticed a 10% improvement. The AI is like a hype-man who loves the performance a little too much!

Summary

This paper is a roadmap for making AI feel human in every corner of the world.

  1. They built a menu of what "native" sounds like (MENLO).
  2. They proved that comparing two answers is the best way to judge quality.
  3. They trained an AI to be a super-judge using a "taste-and-adjust" method.
  4. They used that judge to teach the AI to speak like a local, not a tourist.

The ultimate goal? To ensure that when you talk to an AI in any of these 47 languages, it feels like you're chatting with a neighbor, not a machine reading a dictionary.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →