Is continuous CoT better suited for multi-lingual reasoning?

Imagine you are trying to teach a brilliant but very literal student how to solve complex puzzles. This student is an AI, and the puzzles are math problems and common-sense questions.

The big problem the researchers found is that this student is amazing at solving puzzles when they are written in English (the "high-resource" language), but they get very confused and make mistakes when the puzzles are written in Urdu, Chinese, or German (the "low-resource" languages).

This paper asks a simple question: Is there a way to teach the student to "think" in a way that doesn't depend on the specific language they are speaking?

Here is the breakdown of their solution, using some everyday analogies.

The Old Way: The "Translator" Problem

Traditionally, when an AI tries to solve a problem in a foreign language, it often does one of two things:

Translate first: It translates the foreign question into English, thinks in English, and then translates the answer back.
- The Flaw: This is like trying to explain a joke in a different language by translating it word-for-word. You often lose the nuance, the humor, or the specific cultural context. The "translation step" acts as a bottleneck where information gets lost.
Think out loud in the target language: The AI tries to write out its reasoning step-by-step in Urdu or Chinese.
- The Flaw: If the AI hasn't seen enough examples of "thinking out loud" in Urdu during its training, it gets stuck. It's like asking someone to write a complex essay in a language they only know a few words of.

The New Way: The "Silent Sketch" (Continuous CoT)

The researchers tried a different approach called Continuous Chain-of-Thought (CODI).

Instead of forcing the AI to write out every single thought in words (tokens), they taught it to "think" in a silent, internal sketch.

The Analogy: Imagine you are solving a math problem.
- Standard AI (CoT-SFT): It writes out every step on a piece of paper: "First, I add 5 and 3. That makes 8. Then I multiply by 2..." This takes up a lot of space and relies heavily on knowing the exact words for "add," "multiply," etc.
- The New AI (CODI): It doesn't write the words. Instead, it draws a quick, abstract mental map or a "feeling" of the solution in its mind. It's like a chef who doesn't need to read a recipe book to know how to make a soup; they just know the flow of flavors.

Why is this better for different languages?

The researchers tested this on five very different languages: English, Chinese, German, French, and Urdu.

Language Invariance (The Universal Shape):
Think of the "meaning" of a math problem as a shape. In English, the shape is drawn with English letters. In Urdu, it's drawn with Urdu letters. But the shape itself (the logic) is the same.
The "Silent Sketch" method teaches the AI to recognize the shape of the logic, rather than the letters used to describe it. Because the "sketch" is abstract, it works almost the same whether the problem is in English or Urdu. It's like recognizing a friend's face whether they are wearing a hat, sunglasses, or a scarf.
The "Zero-Shot" Miracle:
The most impressive result happened with Urdu. The researchers trained the AI on English, German, French, and Chinese, but never showed it Urdu during training.
- Standard AI: When asked an Urdu question, it failed miserably because it had never seen the "words" for the steps.
- Silent Sketch AI: Even though it had never seen Urdu, it could still solve the puzzle! It figured out that the "shape" of the logic in Urdu was similar enough to the shapes it learned in other languages. It generalized the skill.
Extreme Efficiency (The Compression):
Writing out thoughts takes a lot of space.
- Standard AI: To solve a problem, it might write 300 words of reasoning.
- Silent Sketch AI: It compresses those 300 words into a tiny, dense "thought packet" that is only about 6 units long.
- The Result: The new method is 29 to 50 times more efficient. It's like sending a 50-page letter via a tiny, encrypted data chip instead of mailing the whole book.

The Verdict

The paper concludes that teaching AI to "think" in a continuous, abstract space (like a mental sketch) is much better than forcing it to "talk" its way through the problem.

For high-resource languages (like English): Both methods work okay, though the old way sometimes wins slightly.
For low-resource languages (like Urdu): The "Silent Sketch" method wins by a landslide. It bridges the gap between languages, making the AI fairer and more capable for everyone, regardless of what language they speak.

In short: Instead of teaching the AI to speak every language perfectly, they taught it to think in a way that transcends language entirely.

Here is a detailed technical summary of the paper "Is Continuous CoT Better Suited for Multilingual Reasoning?"

1. Problem Statement

Large Language Models (LLMs) exhibit significant performance disparities across languages. While they perform well on high-resource languages (e.g., English), their reasoning capabilities degrade substantially in low-resource languages. Existing solutions face critical limitations:

Translation-based approaches: Translating prompts to a pivot language (like English) before reasoning introduces bottlenecks and loses linguistic nuances.
Multilingual Fine-tuning: Directly fine-tuning on multilingual Chain-of-Thought (CoT) data is scalable only to a limited extent and risks "catastrophic forgetting" as more languages are added.
Token Inefficiency: Standard CoT requires generating verbose natural language tokens for every reasoning step, which is computationally expensive.

The authors investigate whether performing reasoning in a continuous latent space (rather than explicit token space) can yield more robust, language-agnostic representations that generalize better to low-resource languages.

2. Methodology

The study compares two fine-tuning strategies on the LLaMA3.2-1B-Instruct base model across five typologically diverse languages: English, Chinese, German, French, and Urdu.

Datasets

GSM8k-Aug-NL: A math reasoning dataset (385k training samples) with CoT traces.
CommonsenseQA-CoT: A commonsense reasoning dataset (~8.1k training samples) with CoT annotations.
Multilingual Construction: Questions were translated into target languages using high-capacity models (Llama-3.3-70B, Qwen2.5-72B, GPT-5-mini) while preserving mathematical structures and numerical values. Crucially, the datasets were constructed to ensure zero overlap between languages to prevent data leakage.

Approaches Compared

CoT-SFT (Baseline): Standard Supervised Fine-Tuning where the model learns to generate explicit reasoning tokens followed by the answer.
Continuous CoT (CODI): Utilizes the CODI framework (Shen et al., 2025), which employs a self-distillation setup:
- Teacher Task: Learns standard explicit CoT generation (token-based).
- Student Task: Generates reasoning in a continuous latent space using hidden states ( $Z$ ) propagated autoregressively between <bot> and <eot> tokens, without outputting intermediate text tokens.
- Knowledge Distillation: The student's hidden states are aligned with the teacher's hidden states immediately preceding the answer token using an L1 loss. This anchors the latent reasoning to the explicit trace, preventing drift.

Experimental Configurations

English-only Training: To establish baselines.
Multilingual Training (Mixed):
- Setup A: Trained on English, German, French, and Chinese (Urdu excluded) to test zero-shot generalization.
- Setup B: Trained on all five languages including Urdu.

3. Key Results

Performance on Low-Resource Languages

Zero-Shot Generalization: When trained without Urdu, the CODI model significantly outperformed CoT-SFT on Urdu test sets.
- Example (CommonsenseQA): CODI achieved 35.95% accuracy on Urdu (zero-shot), whereas CoT-SFT (trained with Urdu) only achieved 34.73%.
- This demonstrates that continuous reasoning learns more language-invariant representations, allowing it to generalize to unseen languages better than explicit token-based reasoning.
Low-Resource vs. High-Resource:
- On GSM8k, CODI performed slightly worse than CoT-SFT on high-resource languages (English, German, French) but significantly better on low-resource languages (Chinese, Urdu).
- On CommonsenseQA, CODI outperformed CoT-SFT across all languages.

Efficiency and Compression

The continuous approach offers massive efficiency gains by compressing reasoning traces:

GSM8k: Achieved a compression ratio of ~29× (6 latent tokens vs. ~176 explicit tokens).
CommonsenseQA: Achieved a compression ratio of ~50× (6 latent tokens vs. ~299 explicit tokens).

Comparison with Baselines

Both CODI and CoT-SFT outperformed the untrained base model across all languages. However, CODI's ability to maintain high performance in zero-shot low-resource scenarios while drastically reducing token count is the primary differentiator.

4. Key Contributions

Empirical Evidence for Language Invariance: The paper provides the first empirical evidence that continuous latent reasoning naturally exhibits greater language invariance compared to explicit token-based reasoning, leading to superior zero-shot performance in low-resource languages.
Scalability: Demonstrates a scalable solution for cross-lingual reasoning that avoids the pitfalls of translation bottlenecks and catastrophic forgetting associated with adding more languages to the training set.
Extreme Efficiency: Validates that reasoning can be compressed by 29× to 50× without sacrificing (and often improving) reasoning quality, particularly in multilingual contexts.
Rigorous Evaluation: Conducted a comprehensive study across five typologically diverse languages with strict zero-overlap data construction to ensure valid cross-lingual generalization metrics.

5. Significance

This work suggests a paradigm shift in how LLMs handle multilingual reasoning. Instead of relying on explicit, verbose natural language chains that are heavily dependent on specific linguistic training data, moving reasoning into a continuous latent space allows models to abstract the logic of reasoning from the language of expression.

This approach offers a promising path toward:

Democratizing AI: Enabling high-quality reasoning capabilities in low-resource languages where data is scarce.
Cost Reduction: Drastically reducing inference costs and latency by eliminating the need to generate long chains of thought tokens.
Robustness: Creating models that are less sensitive to the specific language of the prompt, focusing instead on the underlying semantic and logical structure of the problem.

The authors conclude that continuous CoT is a superior strategy for multilingual reasoning, particularly for low-resource settings, and plan to scale this investigation to larger models and broader domains in future work.