Optimizing Language Models for Crosslingual Knowledge Consistency

This paper introduces Direct Consistency Optimization (DCO), a reinforcement learning-inspired method that significantly improves crosslingual knowledge consistency in large language models by deriving a structured reward function directly from the model itself, thereby eliminating the need for an explicit reward model while outperforming existing approaches.

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

Published 2026-03-06
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Optimizing Language Models for Crosslingual Knowledge Consistency" using simple language and creative analogies.

The Problem: The "Schizophrenic" Translator

Imagine you have a brilliant, multilingual assistant named "AI." You ask them a simple question in English: "What is the capital of the Netherlands?" They confidently reply, "Amsterdam."

But then, you ask the exact same question in Dutch: "Wat is de hoofdstad van Nederland?" Suddenly, the AI gets confused and answers, "Rotterdam."

This is the problem the paper tackles. Large Language Models (LLMs) are like overworked students who studied for a test in many different languages but didn't realize they were taking the same test. They often give inconsistent answers depending on the language used. This makes them unreliable. If you trust them in English but they lie to you in Spanish, you can't trust them at all.

The Old Way: The "Majority Vote" (CALM)

Previous attempts to fix this were a bit clumsy. One method (called CALM) was like asking a group of 10 people the same question in 10 different languages, then taking a "majority vote" to decide the right answer.

  • The Flaw: If you only have two languages (like English and Swahili), you can't take a majority vote. Also, if you include languages the AI is bad at, the "vote" gets noisy and wrong. It's like trying to find the truth by asking a room full of people where 90% are guessing randomly.

The New Solution: DCO (Direct Consistency Optimization)

The authors propose a new method called DCO. Think of DCO not as a teacher grading a test, but as a conductor tuning an orchestra.

1. The Core Idea: "The Echo Chamber"

In the old days, the AI learned by being told, "This answer is right, that one is wrong." DCO is different. It doesn't need a human to say "Right" or "Wrong."

Instead, it asks the AI: "If you answered this question in English, what would you say? Now, if you answered it in French, what would you say? Do those answers match?"

If the AI says "Amsterdam" in English but "Rotterdam" in French, DCO says, "Hey, you're contradicting yourself! Let's adjust your internal settings so that your 'English brain' and your 'French brain' agree on the same fact."

2. The Magic Formula: The "Product of Experts"

The paper describes the AI's new brain as a "Product of Experts."

Imagine the AI has a different "expert" for every language it knows.

  • The English Expert says: "I think the answer is Amsterdam."
  • The French Expert says: "I think the answer is Paris." (Wait, that's wrong).
  • The Dutch Expert says: "I think the answer is Amsterdam."

DCO forces these experts to talk to each other. It creates a rule: "Your final answer must be a blend of what all the language experts say." If the English and Dutch experts agree, but the French expert is confused, the system nudges the French expert to listen to the others.

This creates a self-correcting loop. The AI uses its own knowledge in one language to fix its knowledge in another, without needing a human to step in and say "No, that's wrong."

3. The "Direction Dial" (Controlling the Flow)

One of the coolest features of DCO is that you can control who learns from whom.

Imagine a dial with two settings: English and Swahili.

  • Default Setting: Both learn from each other equally.
  • English-Stable Mode: You turn the dial so English stays exactly as it is (because it's already very good), and Swahili is forced to change to match English.
  • Swahili-Stable Mode: You do the reverse (rarely done, but possible).

This is like a mentorship program. If you have a master chef (English) and a trainee (Swahili), you can tell the trainee: "Copy the master's recipe exactly." You don't want the trainee to accidentally change the master's recipe. DCO lets you set this "mentorship" direction easily.

Why This Matters

  1. No Human Needed: Unlike other methods that need humans to grade answers, DCO teaches the AI to be consistent with itself. It's like a student studying their own notes to find contradictions.
  2. Works Everywhere: It works whether you are teaching the AI 2 languages or 20.
  3. Better Accuracy: Surprisingly, by forcing the AI to be consistent, it actually gets smarter. When the "English brain" and "French brain" agree, they reinforce each other, leading to fewer mistakes.

The Bottom Line

This paper introduces a way to make AI models honest across languages. Instead of acting like a confused tourist who speaks different languages but tells different stories in each, DCO helps the AI become a unified, reliable expert who tells the same truth, no matter which language you ask in.

It's like giving the AI a single, global memory bank, ensuring that "Amsterdam" is "Amsterdam" whether you ask in English, Japanese, or Swahili.