Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

This paper introduces a novel 21-way multiparallel EuroParl dataset to assess political bias in multilingual Large Language Models by demonstrating that translation quality systematically favors majority parties over outsider groups, offering a fairness-based alternative to traditional English survey methods.

Paul Lerner, François Yvon

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you have a giant, super-smart robot translator that speaks 21 different European languages. You ask it to translate a speech from a politician, and it does a great job. But what if the robot secretly has a favorite team? What if it translates speeches from "Team Left" or "Team Right" with extra care and accuracy, while translating speeches from "Team Outsider" with more mistakes and clumsiness?

That is exactly what this paper investigates. The authors, Paul Lerner and François Yvon, wanted to see if modern AI translators are fair to all political groups, or if they have a hidden bias.

Here is the breakdown of their study, explained with some everyday analogies.

1. The Problem: The "English-Only" Survey

Usually, when people check if AI is biased, they ask the AI a bunch of questions in English, like a multiple-choice test. "Do you agree that X is bad?"

  • The Flaw: This is like judging a chef only by how they cook a burger. It doesn't tell you if they can cook sushi, pasta, or curry. Also, the questions themselves might be tricky or confusing.

2. The Solution: The "21-Lane Highway"

Instead of asking questions, the authors looked at real speeches from the European Parliament.

  • The Dataset: They built a massive new dataset called 21-EuroParl. Imagine a highway with 21 lanes. In the middle lane, there is a speech in German. In the other 20 lanes, there are perfect translations of that exact same speech in 20 other languages (French, Polish, Spanish, etc.).
  • The Metadata: Crucially, they tagged every speech with the speaker's political party. It's like knowing exactly which team the player is on before the game starts.
  • The Scale: This isn't a small test. It's 1.5 million sentences, covering 1,000+ speakers from 27 countries. It's a massive library of political debate.

3. The Experiment: The "Fairness Race"

The researchers took this massive library and fed it into several popular AI models (like Llama, Qwen, and Gemma). They asked the AI to translate the speeches.

Then, they used a clever scoring system called the Borda Count.

  • The Analogy: Imagine a race with 8 runners (the 8 major political parties). In every single translation task (e.g., German to French), the AI ranks them based on how good the translation was.
    • The party with the best translation gets 7 points.
    • The party with the worst translation gets 0 points.
  • They did this for 420 different language combinations. If the AI were perfectly fair, every party would end up with the same average score (about 3.5 points), just like a coin flip.

4. The Findings: The "VIP Treatment"

The results were shocking. The AI was not a neutral referee. It was a biased fan.

  • The Favorites: The "Establishment" parties (like the European People's Party and the Socialists) got the VIP treatment. Their speeches were translated with high accuracy and smoothness.
  • The Underdogs: The "Outsider" parties (like the far-left or non-aligned members) got the rough treatment. Their speeches were translated with more errors and lower quality.
  • The Pattern: This happened across all the different AI models they tested, regardless of how big or smart the model was. It wasn't just one bad robot; it was a systemic issue.

Think of it like this: If you walk into a restaurant, the waiter might serve the steak to the regular customers perfectly, but accidentally drop the fries for the new customers. The food is the same, but the service depends on who you are.

5. Why Does This Happen?

The authors suggest a few reasons, like a recipe that was cooked with too much of one ingredient:

  1. Training Data: The AI learned from the internet. Maybe there are more high-quality translations of mainstream party speeches online than there are for smaller, fringe parties.
  2. Topic Differences: Mainstream parties might talk about standard topics (economy, health) that are easier for AI to translate, while outsiders might use more complex or niche language.

6. The Big Warning

The paper ends with a serious note of caution.

  • The Stakes: We are starting to use AI to help translate laws, debates, and news in democracies.
  • The Risk: If an AI translates a speech from a mainstream politician perfectly but garbles the speech of a minority politician, it distorts democracy. It makes the mainstream look smarter and the outsiders look confused or incoherent.

The Takeaway

This paper is a wake-up call. It shows that AI isn't just a neutral tool; it carries the biases of the data it was trained on. Just because a robot speaks 21 languages doesn't mean it treats all speakers equally. If we want fair democracies, we need to make sure our translators are fair referees, not biased fans.