Optimizing Language Models for Crosslingual Knowledge Consistency

Here is an explanation of the paper "Optimizing Language Models for Crosslingual Knowledge Consistency" using simple language and creative analogies.

The Problem: The "Schizophrenic" Translator

Imagine you have a brilliant, multilingual assistant named "AI." You ask them a simple question in English: "What is the capital of the Netherlands?" They confidently reply, "Amsterdam."

But then, you ask the exact same question in Dutch: "Wat is de hoofdstad van Nederland?" Suddenly, the AI gets confused and answers, "Rotterdam."

This is the problem the paper tackles. Large Language Models (LLMs) are like overworked students who studied for a test in many different languages but didn't realize they were taking the same test. They often give inconsistent answers depending on the language used. This makes them unreliable. If you trust them in English but they lie to you in Spanish, you can't trust them at all.

The Old Way: The "Majority Vote" (CALM)

Previous attempts to fix this were a bit clumsy. One method (called CALM) was like asking a group of 10 people the same question in 10 different languages, then taking a "majority vote" to decide the right answer.

The Flaw: If you only have two languages (like English and Swahili), you can't take a majority vote. Also, if you include languages the AI is bad at, the "vote" gets noisy and wrong. It's like trying to find the truth by asking a room full of people where 90% are guessing randomly.

The New Solution: DCO (Direct Consistency Optimization)

The authors propose a new method called DCO. Think of DCO not as a teacher grading a test, but as a conductor tuning an orchestra.

1. The Core Idea: "The Echo Chamber"

In the old days, the AI learned by being told, "This answer is right, that one is wrong." DCO is different. It doesn't need a human to say "Right" or "Wrong."

Instead, it asks the AI: "If you answered this question in English, what would you say? Now, if you answered it in French, what would you say? Do those answers match?"

If the AI says "Amsterdam" in English but "Rotterdam" in French, DCO says, "Hey, you're contradicting yourself! Let's adjust your internal settings so that your 'English brain' and your 'French brain' agree on the same fact."

2. The Magic Formula: The "Product of Experts"

The paper describes the AI's new brain as a "Product of Experts."

Imagine the AI has a different "expert" for every language it knows.

The English Expert says: "I think the answer is Amsterdam."
The French Expert says: "I think the answer is Paris." (Wait, that's wrong).
The Dutch Expert says: "I think the answer is Amsterdam."

DCO forces these experts to talk to each other. It creates a rule: "Your final answer must be a blend of what all the language experts say." If the English and Dutch experts agree, but the French expert is confused, the system nudges the French expert to listen to the others.

This creates a self-correcting loop. The AI uses its own knowledge in one language to fix its knowledge in another, without needing a human to step in and say "No, that's wrong."

3. The "Direction Dial" (Controlling the Flow)

One of the coolest features of DCO is that you can control who learns from whom.

Imagine a dial with two settings: English and Swahili.

Default Setting: Both learn from each other equally.
English-Stable Mode: You turn the dial so English stays exactly as it is (because it's already very good), and Swahili is forced to change to match English.
Swahili-Stable Mode: You do the reverse (rarely done, but possible).

This is like a mentorship program. If you have a master chef (English) and a trainee (Swahili), you can tell the trainee: "Copy the master's recipe exactly." You don't want the trainee to accidentally change the master's recipe. DCO lets you set this "mentorship" direction easily.

Why This Matters

No Human Needed: Unlike other methods that need humans to grade answers, DCO teaches the AI to be consistent with itself. It's like a student studying their own notes to find contradictions.
Works Everywhere: It works whether you are teaching the AI 2 languages or 20.
Better Accuracy: Surprisingly, by forcing the AI to be consistent, it actually gets smarter. When the "English brain" and "French brain" agree, they reinforce each other, leading to fewer mistakes.

The Bottom Line

This paper introduces a way to make AI models honest across languages. Instead of acting like a confused tourist who speaks different languages but tells different stories in each, DCO helps the AI become a unified, reliable expert who tells the same truth, no matter which language you ask in.

It's like giving the AI a single, global memory bank, ensuring that "Amsterdam" is "Amsterdam" whether you ask in English, Japanese, or Swahili.

Here is a detailed technical summary of the paper "Optimizing Language Models for Crosslingual Knowledge Consistency."

1. Problem Statement

Large Language Models (LLMs) often exhibit Crosslingual Knowledge Inconsistency (CLC). When asked the same factual question in different languages, an LLM may provide conflicting answers (e.g., answering "Amsterdam" in English but "Rotterdam" in Dutch). This undermines the reliability of multilingual systems and confuses users.

Existing alignment methods like Direct Preference Optimization (DPO) or Reinforcement Learning from Human Feedback (RLHF) typically rely on human preference pairs (winning/losing answers) or reward models trained on specific languages. However, these approaches do not inherently guarantee that the model's preference ranking remains consistent across different languages. Furthermore, methods like CALM (which uses majority voting across languages) struggle in bilingual settings or when low-resource languages are involved due to noisy voting signals.

2. Methodology: Direct Consistency Optimization (DCO)

The authors propose a novel framework to align LLMs for crosslingual consistency without requiring explicit reward models or gold-standard labels for every language pair.

A. Theoretical Foundation: Structured Reward Function

The core insight is to define a reward function based on the likelihoods assigned by the model itself to answers in parallel languages.

Definition of Consistency: A policy $\pi^*$ is consistent if the relative preference between two responses ( $y_w$ vs. $y_l$ ) remains the same regardless of the language ( $L_1$ or $L_2$ ) the prompt is in.
The Reward Function ( $r_{ALIGN}$ ):
For a prompt $x$ and response $y$ in language $L_1$ , the reward is defined using the log-likelihood of the translated response in $L_2$ :
$r_{ALIGN}(x, y) = \gamma_1 \log \pi_{REF}(\tau^2(y) | \tau^2(x))$
(And vice versa for $L_2$ ).
Here, $\pi_{REF}$ is the base reference model, $\tau$ represents translation, and $\gamma$ are hyperparameters controlling alignment strength.
Optimal Policy: By maximizing this reward with a KL-divergence constraint, the optimal policy takes a Product-of-Experts form:
$\pi^*(y | x) \propto \pi_{REF}(y | x) \cdot \pi_{REF}(\tau(y) | \tau(x))^{\gamma/\beta}$
The authors prove that if the hyperparameters satisfy $\gamma_1 \gamma_2 = \beta^2$ , the resulting policy is theoretically guaranteed to be consistent across languages.

B. The Algorithm: Direct Consistency Optimization (DCO)

To avoid the computational cost of online sampling (as in PPO) or training a separate reward model (as in standard RLHF), the authors derive DCO, a DPO-inspired algorithm.

Mechanism: Instead of training a reward model to predict human preferences, DCO directly optimizes the policy to match the difference in log-likelihoods between parallel prompt-response pairs.
Loss Function: The objective minimizes the difference between the model's estimated reward difference and the target reward difference derived from the reference model's likelihoods in the parallel language:
$L(\theta) = \mathbb{E} \left[ \left| (\hat{r}_\theta(y_w) - \hat{r}_\theta(y_l)) - \gamma \log \frac{\pi_{REF}(y_w')}{\pi_{REF}(y_l')} \right| \right]$
where $y'$ denotes the translation of the response.
Advantages:
- No Gold Labels Required: It does not require knowing the "correct" answer, only parallel data (questions and candidate answers in multiple languages).
- No Reward Model: It bypasses the need to train a separate reward model.
- Efficiency: It is a direct optimization of the policy, similar to DPO.

3. Key Contributions

Novel Reward Formulation: Introduced a structured reward function that leverages the model's own cross-lingual likelihoods to enforce consistency, theoretically guaranteeing consistent preferences under specific hyperparameter constraints.
DCO Algorithm: Proposed an efficient, DPO-like algorithm that solves the consistency alignment objective without explicit reward modeling or online sampling.
Controllable Alignment: Demonstrated that hyperparameters ( $\gamma_1, \gamma_2$ ) allow practitioners to control the "direction" of alignment (e.g., anchoring to a high-resource language like English while updating a low-resource language, or balancing both).
Generalizability: Showed that the method works across diverse model families (Qwen, Llama, Gemma, Aya) and extends to out-of-domain scenarios.

4. Experimental Results

The authors evaluated DCO on 9 advanced LLMs (ranging from 3B to 14B parameters) across 3 datasets (MMMLU, XCSQA, BMLAMA) covering 26 languages.

Crosslingual Consistency (RankC): DCO significantly improved consistency scores (RankC) across all models.
- In joint multi-language training, DCO outperformed SFT, standard DPO, and CALM.
- On the MMMLU dataset, DCO improved consistency by +4.79 to +12.60 points compared to baselines.
- Notably, DCO improved consistency even for typologically distant language pairs (e.g., Korean-French, Arabic-Chinese).
Accuracy Preservation/Improvement:
- Unlike some alignment methods that degrade performance, DCO maintained or improved accuracy in non-English languages.
- In bilingual settings (English + Low-resource), DCO achieved a Pareto improvement: increasing accuracy in the low-resource language while keeping English accuracy stable or slightly improved.
Out-of-Domain Generalization: When trained on a single subject (High School Microeconomics), DCO improved consistency and accuracy on unseen domains (e.g., Medical Genetics, College Math), demonstrating robust knowledge transfer.
Bilingual vs. Multi-lingual: DCO was effective in both settings. In bilingual setups, it successfully aligned English with specific local languages (e.g., Swahili, Yoruba) without the noise issues seen in majority-voting methods like CALM.
On-Policy RL: A pilot study showed that the consistency reward could also be used in on-policy RL settings for open-ended generation (e.g., GSM8K), further validating the approach.

5. Significance and Impact

Reliability: DCO addresses a critical flaw in multilingual LLMs—knowledge inconsistency—making them more reliable for global deployment.
Efficiency: By removing the need for reward models and gold labels, DCO lowers the barrier to improving multilingual consistency, making it accessible for languages with scarce annotated data.
Practical Control: The ability to tune directionality allows developers to prioritize specific languages (e.g., anchoring to a high-quality English model) while upgrading others, a crucial feature for real-world applications where resource availability varies.
Foundation for Future Work: The paper suggests that this structured reward approach could be extended to other forms of consistency, such as self-consistency across paraphrases or cross-modal consistency.

In conclusion, DCO establishes a robust, efficient, and theoretically grounded solution for aligning multilingual LLMs, ensuring that knowledge is represented consistently regardless of the language used to query the model.