Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Imagine you have a group of very smart, but slightly biased, robots. These robots are trained by reading almost everything ever written on the internet. Because the internet has a lot of old-fashioned ideas about men and women, these robots often accidentally learn those stereotypes too.

For example, if you ask a robot to translate a sentence about a "nurse" from a language that doesn't care about gender (like Basque) into a language that does (like Spanish or French), the robot might guess, "Oh, nurses are usually women, so I'll use the female word." But if you ask about a "mechanic," it might guess, "Mechanics are usually men, so I'll use the male word."

The problem is, the robots often get this wrong based on reality. In the real world, there are plenty of male nurses and female mechanics, but the robots stick to their old stereotypes.

This paper is like a detective report from a team of researchers in the Basque Country. They wanted to see if these robots were being fair when dealing with the Basque language, which is unique because it doesn't have "male" or "female" words for jobs or people.

Here is how they investigated, using two creative "tests":

Test 1: The "Job Swap" (WinoMTeus)

The Setup: Imagine you have a list of jobs in Basque where the word is neutral (it doesn't say "male nurse" or "female nurse"). It just says "nurse."
The Experiment: The researchers asked the robots to translate these neutral jobs into Spanish and French.
The Trap: Since Spanish and French must pick a gender (you can't say "the nurse" without saying "the male nurse" or "the female nurse"), the robot has to make a guess.
The Reality Check: The researchers compared the robots' guesses against real-life statistics from the Basque Country. They asked: "Did the robot guess that 90% of nurses are men, even though in real life, 96% are women?"

The Verdict: The robots were guilty! They had a strong habit of defaulting to the "male" version, even for jobs that are mostly done by women in real life. It's like a robot that thinks every doctor is a man and every secretary is a woman, just because it read too many old books.

Test 2: The "Mirror Test" (FLORES+Gender)

The Setup: This time, they did the reverse. They took sentences from English and Spanish where the gender was clearly marked (e.g., "The male driver" vs. "The female driver").
The Experiment: They asked the robots to translate these into neutral Basque.
The Question: Does the robot translate the sentence better if the person in the story is a man? Does it stumble more if the person is a woman?
The Analogy: Imagine a translator who is slightly more confident and fluent when talking about men, but gets a little nervous and makes more mistakes when talking about women.

The Verdict: The results were a bit mixed, but there was a hint of bias. In some cases, the robots translated sentences about men slightly better than sentences about women. It's as if the robot's "muscle memory" is stronger for male stories because it has seen them more often in its training data.

The Big Picture

The researchers found that even though Basque is a language that naturally treats men and women equally, the robots translating into or out of Basque are bringing their own baggage with them. They are acting like a broken mirror that distorts reality to fit an old stereotype.

Why does this matter?
If we use these robots to translate job ads, news, or medical advice, they might accidentally tell a woman she can't be a mechanic or tell a man he can't be a nurse. This paper is a wake-up call: we need to build better "glasses" for these robots so they can see the real world, not just the biased world they were trained on.

In short: The robots are smart, but they are also a bit sexist. The researchers built new tools to catch them in the act, proving that we need to teach them to be fairer, especially for languages like Basque that deserve to be treated with respect.

Here is a detailed technical summary of the paper "Gender Bias in MT for a Genderless Language: New Benchmarks for Basque."

1. Problem Statement

Large Language Models (LLMs) and Machine Translation (MT) systems often reproduce gender biases present in their training data. However, current bias evaluation resources are predominantly designed for English, reflecting its specific sociocultural context and grammatical features (e.g., gendered pronouns). This creates a significant gap for:

Low-resource languages: Which lack dedicated bias benchmarks.
Genderless languages: Languages like Basque (Euskara) that lack grammatical gender (no gendered articles, adjectives, or pronouns).

In genderless languages, bias cannot be detected via explicit grammatical markers. Instead, it manifests when translating into gendered languages (where the model must choose a gender) or when translating from gendered languages (where the source gender might influence translation quality). The authors argue that existing English-centric benchmarks (like WinoMT or WinoBias) cannot be directly applied to Basque due to these typological differences.

2. Methodology

To address this, the authors introduced two new datasets and evaluated a wide range of models across two translation directions.

A. New Benchmarks

WinoMTeus (Basque $\to$ Gendered Languages):
- Adaptation: Based on the WinoMT benchmark, this dataset translates gender-neutral Basque sentences containing occupations into gendered languages (Spanish and French).
- Construction: Involves creating a 78-term occupation glossary, translating via GPT-4o, post-editing for cultural/linguistic accuracy, and filtering duplicates.
- Goal: To measure if models default to masculine forms for gender-neutral occupations and to compare the resulting gender distribution against real-world labor statistics from the Basque Country (Lanbide).
- Metrics: Pearson correlation with real-world data and the GRAPE (Gender RAtion ProbabiliTiEs) metric to quantify bias direction and magnitude.
FLORES+Gender (Gendered Languages $\to$ Basque):
- Adaptation: Extends the FLORES+ benchmark by creating contrastive pairs in Spanish (strongly gendered) and English (weakly gendered).
- Construction: Sentences are manually adapted to create parallel versions with masculine and feminine referents (changing names, pronouns, and agreement markers) while maintaining semantic equivalence.
- Annotations: Sentences are tagged for specific phenomena: Multi-Entity (ME), Proper Names (PN), and Unmarked Masculine (UM).
- Goal: To determine if translation quality into Basque varies depending on the gender of the source referent.
- Metrics: chrF++ and Translation Edit Rate (TER), assessed for statistical significance via paired approximate randomization.

B. Models Evaluated

The study evaluated three categories of systems:

General-purpose LLMs: Latxa (8B/70B), Llama 3.1 (8B/70B), GPT-5, Claude 4 Sonnet, DeepSeek-V3.2.
Open NMT Models: MADLAD-400, NLLB-200, HiTZ Center's custom Basque models.
Proprietary Services: Google Translate, Elia, Batua, Itzuli.

3. Key Results

A. WinoMTeus (Basque $\to$ Spanish/French)

Systematic Masculine Bias: All evaluated models showed a systematic preference for masculine forms when translating gender-neutral occupations, even for professions that are statistically female-dominated in the real world (e.g., housekeepers, nurses).
Correlation with Reality: Some models (GPT-5, NLLB-200, Latxa 70B) showed a moderate positive correlation ( $r > 0.4$ ) with real-world labor statistics, suggesting they capture some societal patterns. However, the bias remains strong, often exaggerating the "masculine default."
GRAPE Findings: The GRAPE-M (masculine bias) scores were consistently high for feminized professions. For example, "housekeeper" (96.5% female in reality) was frequently translated as masculine. The only exception was "nurse," which showed slight feminine overrepresentation, though still weak compared to the masculine bias.
Model Performance: Models explicitly trained for translation (e.g., NLLB, SalamandraTA) generally aligned better with reality than general-purpose LLMs, though the latter showed improvement when fine-tuned for Basque (Latxa vs. Llama).

B. FLORES+Gender (Spanish/English $\to$ Basque)

Translation Quality Disparity: In the Spanish $\to$ Basque direction, models generally achieved slightly higher quality scores for masculine referents, particularly when the source used the "unmarked masculine" (generic masculine).
Statistical Significance: Most differences were small and not statistically significant. The exception was the Batua system, which showed significant performance drops for feminine sentences.
English $\to$ Basque: No consistent bias pattern emerged. Some models performed better on feminine sentences, others on masculine, suggesting that the weak gender marking in English has a less deterministic effect on Basque translation quality than Spanish does.
Linguistic Factors: The presence of unmarked masculine forms in Spanish had the most consistent negative impact on feminine translation quality. Proper names and multi-entity contexts showed mixed results depending on the specific model.

4. Key Contributions

First Basque Bias Benchmarks: The release of WinoMTeus and FLORES+Gender, the first dedicated resources for evaluating gender bias in Basque MT.
Real-World Alignment: WinoMTeus uniquely integrates official labor statistics to evaluate not just linguistic bias, but the model's alignment with actual societal gender distributions.
Typological Insight: The study demonstrates that even in genderless languages, bias is not absent; it is simply transferred during the translation process into gendered languages or manifests as quality degradation when translating from gendered sources.
Comprehensive Evaluation: A broad comparison of general-purpose LLMs, open NMT, and proprietary services, highlighting that specialized training does not fully eliminate bias.

5. Significance and Limitations

Significance: The paper highlights that "genderless" does not mean "bias-free." It provides a framework for evaluating bias in low-resource, non-Indo-European languages, urging the community to move beyond English-centric evaluation. It confirms that the "masculine default" is a deep-seated issue in MT, persisting even when the source language lacks gender markers.
Limitations:
- Binary Gender: The study treats gender as a binary (M/F), which does not account for non-binary identities, though this reflects the grammatical constraints of the target languages (Spanish/French).
- Dataset Scope: WinoMTeus is template-based and focuses only on occupations; FLORES+Gender relies on relatively small sets of manually adapted sentences.
- Metrics: Automatic metrics (chrF, TER) may not fully capture the nuances of gender-related errors or representational harm.
- Temperature: Experiments used temperature=0 for reproducibility, which may not reflect the variability of real-world API usage.

Conclusion

The authors conclude that current MT systems and LLMs still default to masculine forms, reinforcing stereotypes even when translating from a genderless language. While some models show a degree of alignment with real-world statistics, the systematic preference for masculine forms remains a critical barrier to fairness. The paper calls for the development of evaluation methods that consider both linguistic features and cultural context to build fairer language technologies for low-resource and genderless languages.

Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Test 1: The "Job Swap" (WinoMTeus)

Test 2: The "Mirror Test" (FLORES+Gender)

The Big Picture

1. Problem Statement

2. Methodology

A. New Benchmarks

B. Models Evaluated

3. Key Results

A. WinoMTeus (Basque →\to→ Spanish/French)

B. FLORES+Gender (Spanish/English →\to→ Basque)

4. Key Contributions

5. Significance and Limitations

Conclusion

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance

A. WinoMTeus (Basque $\to$ Spanish/French)

B. FLORES+Gender (Spanish/English $\to$ Basque)