Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Here is an explanation of the paper "Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck" using simple language and creative analogies.

The Problem: The "Foreign Accent" Bias

Imagine you are a famous food critic (the LLM Judge) hired to taste dishes from all over the world. Your job is to decide which dish is the most delicious.

However, there's a glitch in your brain. You have a secret bias: You love food that sounds like it was translated from English.

The Scenario: You are tasting a traditional dish from a small village in Africa (a "low-resource" language).
- Dish A (Human): A local chef cooked it. It tastes authentic, but the description is written in a natural, local style.
- Dish B (Machine): A robot translated the recipe from English into the local language. It sounds a bit stiff and "off," like a foreigner trying to speak the language.
The Flaw: Even though Dish A is better, your brain prefers Dish B. Why? Because Dish B's sentence structure accidentally reminds you of English, which is the language your brain was trained on most heavily. You think, "This sounds smart and structured," when really, it just sounds like a bad translation.

This is called "Translationese Bias." The AI judges prefer machine-translated text over human-written text, especially for languages that aren't spoken by many people online. This makes the AI a terrible judge for those languages.

The Cause: Two "Bad Habits"

The researchers found that the AI has two specific bad habits causing this:

The "English Echo" (Latent Manifold Alignment): The AI's internal brain is shaped like an English speaker's brain. When it sees text that looks like English (even if it's in Swahili or Pashto), it feels comfortable and gives it a high score.
The "Predictability Trap" (Cross-lingual Predictability): Machine translations are often very predictable and follow strict statistical patterns. The AI loves predictability because it's easy to guess what comes next. It mistakes "easy to guess" for "good quality."

The Solution: The "Disentangled Information Bottleneck" (DIBJUDGE)

To fix this, the researchers built a new training system called DIBJUDGE. Think of this system as a strict bouncer at a club who forces the AI to separate its thoughts into two different rooms.

1. The Two Rooms (Disentanglement)

Instead of letting the AI mix all its thoughts together, DIBJUDGE forces it to split its brain into two distinct channels:

Room A: The "Truth Room" (Robust Representation). This room is for the actual meaning of the text. Is the answer correct? Is the story logical? Does it make sense?
Room B: The "Noise Room" (Bias Representation). This room is for the bad habits. Does this sound like English? Is it too predictable? Is it a machine translation?

The goal is to make sure the "Truth Room" never sees the "Noise." The AI must learn to judge the food based only on the taste (meaning), ignoring the fact that the menu was printed in a font that looks like English.

2. The "Compression" (Information Bottleneck)

Imagine you are trying to describe a complex painting to a friend over a phone call with a bad connection. You can't send the whole picture; you have to compress it.

Old Way: You try to send everything, including the frame, the dust on the canvas, and the artist's signature. The friend gets confused by the extra noise.
DIBJUDGE Way: The system forces the AI to compress the message down to the bare minimum needed to make a good judgment. It throws away the "dust" (the translation artifacts) and keeps only the "painting" (the semantic meaning).

3. The "Anti-Correlation" Penalty

During training, the system adds a rule: "If the 'Truth Room' and the 'Noise Room' start talking to each other, you get a penalty."
This forces the two rooms to stay completely separate. The AI learns that to get a good score, it must ignore the "English-like" patterns and focus purely on the content.

The Results: A Fairer Judge

The researchers tested this new system on many different languages, from major ones like Spanish to rare ones like Yoruba.

Before: The AI was a snob. It loved English-like translations and hated authentic local writing, especially in rare languages.
After: The AI became a fair critic. It stopped caring about whether the text sounded like a translation. It started judging based on actual quality.
- It reduced the bias by 50% to 80% depending on the language.
- It didn't lose its ability to judge; in fact, it got better at judging because it wasn't distracted by the "foreign accent."

The Takeaway

This paper is about teaching AI to stop being a "copycat" that prefers things that sound like English. By forcing the AI to separate "what the text means" from "how the text sounds," they created a judge that is fair to everyone, regardless of which language they speak. It's like teaching a food critic to ignore the font on the menu and actually taste the food.

Here is a detailed technical summary of the paper "Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck."

1. Problem Statement: Translationese Bias

The paper addresses a critical failure mode in Large Language Models (LLMs) used as automated judges ("LLM-as-a-Judge") for multilingual evaluation.

The Phenomenon: LLM judges systematically prefer machine-translated text over human-authored references, even when the machine-translated text contains semantic flaws. This is termed "Translationese Bias."
Severity: The bias is pervasive but disproportionately affects low-resource languages, where the severity of the bias is significantly higher than in high-resource languages.
Root Causes (Spurious Correlations): The authors identify two primary spurious factors driving this bias:
1. Latent Manifold Alignment with English: Non-English representations are implicitly aligned to an English-centric latent space during pre-training. Machine-translated text often exhibits stronger geometric alignment to this English manifold than native human text.
2. Cross-Lingual Predictability: LLM judges over-rely on probability heuristics that favor the statistical patterns of machine-translated text (which often have higher cross-lingual predictability/lower surprisal) rather than semantic quality.
Limitation of Current Methods: Standard Supervised Fine-Tuning (SFT) fails to mitigate this because it tends to exploit these spurious correlations to minimize loss, rather than learning robust semantic representations.

2. Methodology: DIBJUDGE

The authors propose DIBJUDGE (Disentangled Information Bottleneck Judge), a robust fine-tuning framework designed to decouple judgment-critical semantics from spurious bias factors.

Core Concept: Disentangled Information Bottleneck

The framework extends the classic Information Bottleneck (IB) principle. Instead of compressing the input $X$ into a single representation $Z$ , DIBJUDGE explicitly disentangles the latent space into two branches:

Robust Representation ( $Z_r$ ): Contains only the minimal sufficient information required for accurate quality judgment (semantic content).
Bias Representation ( $Z_b$ ): Acts as a dedicated "sink" to absorb spurious factors (translation artifacts).

The Objective Function

The training objective ( $L_{DIB}$ ) is a weighted sum of four terms:
$L_{DIB} = -I(Y; Z_r) + \beta I(X; Z_r) - \gamma I(S; Z_b) + \lambda I(Z_r; Z_b)$

Prediction ( $-I(Y; Z_r)$ ): Maximizes mutual information between the robust representation and the target judgment $Y$ .
Compression ( $\beta I(X; Z_r)$ ): Minimizes the information the robust branch retains about the raw input $X$ (via a Variational Information Bottleneck), forcing it to discard non-essential details.
Bias Capture ( $-\gamma I(S; Z_b)$ ): Maximizes the information the bias branch retains about spurious attributes $S$ $S$ . This is achieved via two proxy tasks:
- Cross-Lingual Alignment Contrastive Learning: Encourages $Z_b$ to capture the degree of alignment with the English latent manifold.
- Log-Probability Bin Classification: Encourages $Z_b$ to capture cross-lingual predictability (surprisal) signals.
Disentanglement ( $\lambda I(Z_r; Z_b)$ ): Penalizes the dependence between the robust and bias branches. Since direct MI estimation is intractable, the authors use a Cross-Covariance Penalty (minimizing the squared Frobenius norm of the cross-covariance matrix) as a computationally efficient surrogate for independence.

Architecture

Encoders: Separate encoders ( $g_{\phi_r}$ and $g_{\phi_b}$ ) process the input to generate $Z_r$ and $Z_b$ .
Variational Bottleneck: $Z_r$ is modeled as a Gaussian distribution with a reparameterization trick to enable backpropagation.
Judge Head: A lightweight LoRA-adapted LLM judge uses only $Z_r$ to generate the final output.
Proxy Heads: Lightweight decoders use $Z_b$ to predict the spurious attributes during training.

3. Key Contributions

Characterization of Bias: The paper provides the first systematic characterization of "Translationese Bias" in multilingual LLM judges, identifying "Latent Manifold Alignment" and "Cross-Lingual Predictability" as the causal spurious factors.
Novel Framework (DIBJUDGE): Introduces a disentangled IB framework that explicitly isolates spurious factors into a dedicated branch, preventing the main judgment logic from relying on them.
Theoretical & Practical Disentanglement: Proposes a tractable cross-covariance penalty to enforce statistical independence between robust and bias representations, validated theoretically under Gaussian assumptions.
Comprehensive Evaluation: Demonstrates that the method works across diverse domains (reading comprehension, instruction following, summarization) and resource levels.

4. Experimental Results

The authors evaluated DIBJUDGE on multilingual reward modeling benchmarks (M-RewardBench, MM-Eval, RewardBench) and a dedicated translationese bias suite.

Performance on Reward Benchmarks:
- DIBJUDGE (based on Qwen3-8B) achieved State-of-the-Art (SOTA) performance among open-weight models on M-RewardBench (91.37% accuracy), outperforming strong baselines like mR3 and proprietary models like GPT-4o and Gemini-2.5-Flash.
- It maintained high performance on English-centric benchmarks, proving it does not degrade general capabilities.
Bias Mitigation:
- Significant Reduction: DIBJUDGE reduced translationese bias severity ( $S_{bias}$ ) by 80% on BELEBELE, 56% on AYA, and 75% on XL-SUM compared to Vanilla SFT.
- Low-Resource Impact: The most dramatic improvements were observed in low-resource languages, where the bias was previously most severe.
- Zero-Shot Generalization: The model generalized well to unseen bias types (e.g., length bias, self-preference bias) not explicitly included in the proxy tasks, suggesting it learns to filter superficial heuristics rather than memorizing specific artifacts.
Ablation Studies:
- Removing the disentanglement term or the proxy tasks led to suboptimal results, confirming that both compression and explicit bias isolation are necessary.
- The cross-covariance penalty was shown to be more computationally efficient than kernel-based methods (HSIC) or mutual information estimators (MINE/CLUB) while achieving comparable bias reduction.
Latent Space Analysis:
- t-SNE Visualization: The bias representation ( $Z_b$ ) clearly separated human vs. machine text, while the robust representation ( $Z_r$ ) showed domain invariance (mixed clusters), confirming successful disentanglement.
- Linear Probing: A classifier trained on $Z_r$ could not distinguish between human and machine text (accuracy ~50%), whereas a classifier on $Z_b$ achieved ~96% accuracy.

5. Significance

Reliability in Multilingual AI: As LLMs are increasingly deployed for evaluating global content, ensuring judges do not penalize non-native or low-resource languages due to translation artifacts is critical for fairness.
Methodological Advancement: The paper demonstrates that disentangled representation learning combined with information bottleneck principles is a powerful approach for debiasing LLMs, moving beyond simple data augmentation or standard fine-tuning.
Scalability: The proposed method is computationally efficient (using LoRA and cross-covariance penalties), making it feasible to apply to large-scale multilingual reward modeling tasks.

In conclusion, DIBJUDGE offers a robust solution to a previously overlooked systemic bias, ensuring that automated evaluation systems treat low-resource and machine-translated content fairly, thereby improving the reliability of multilingual AI development.