Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

Imagine you are trying to teach a brilliant, well-traveled chef (the Large Language Model or LLM) how to cook a very specific, rare dish from a remote village. The chef has never been to this village, doesn't speak the local dialect, and has no recipe book for it.

This paper asks a simple question: Can we help this chef cook the dish by showing them a recipe from a neighboring village that speaks a similar language, along with a few sample dishes?

Here is the breakdown of the study using everyday analogies:

1. The Problem: The "Language Gap"

Big AI models are like super-chefs who know how to cook French, Italian, and Chinese cuisine perfectly. But for languages that are rare or "low-resource" (like Konkani in India or Tunisian Arabic), the chef has never seen the ingredients. If you just ask the chef to "Cook this in Konkani," they might guess and serve you Italian food instead because that's what they know best.

Usually, to fix this, you'd need to hire a team of linguists to write thousands of new recipes (training data) and retrain the chef. But that takes too much time and money.

2. The Solution: The "Pivot" and the "Cheat Sheet"

The researchers tried a lighter approach. Instead of retraining the chef, they used two tricks during the cooking process:

The Pivot Language (The Bridge): They picked a language the chef does know that is similar to the target.
- For Konkani, they used Marathi (a neighboring language).
- For Tunisian Arabic, they used Modern Standard Arabic (the formal version).
- Analogy: It's like telling the chef, "First, translate this English order into Marathi (which you know), and then use that Marathi version as a bridge to figure out the Konkani."
Few-Shot Examples (The Cheat Sheet): They showed the chef 3 to 5 examples of "English → Marathi → Konkani" right before asking them to cook the new dish.
- Analogy: It's like handing the chef a small notepad that says, "See how we turned 'Hello' into Konkani? Do it like that."

3. The Experiment: Two Different Chefs

They tested this method on two different "chefs" (AI models):

Hermes: A general-purpose chef who is good at many things but not specialized in translation.
Tower: A chef who was specifically trained to be a translator.

They tested this on two languages:

Konkani: A language with very few digital resources (like a remote village with no internet).
Tunisian Arabic: A dialect that is slightly better represented in the chef's memory because it shares a script with the formal Arabic they know.

4. The Results: It Depends on the Village

The findings were mixed, like a weather report:

For the "Remote Village" (Konkani): The method worked! The "Pivot" (Marathi) and the "Cheat Sheet" (examples) helped the chef stop guessing and actually produce Konkani. The translation quality improved, though it wasn't perfect.
- Takeaway: When the chef knows nothing about the target, a linguistic cousin (pivot) acts as a helpful crutch.
For the "Semi-Known Village" (Tunisian Arabic): The method didn't help much. The chef was already pretty good at this because the script and words were similar to what they already knew. Adding the extra steps (the pivot) didn't make the dish taste better; sometimes it even confused the chef.
- Takeaway: If the chef already has a decent idea of the language, adding a bridge might just be extra noise.
The "Too Much Info" Problem: They found that giving the chef more examples (more than 3 or 4) actually made the results worse.
- Analogy: It's like giving a chef a 50-page manual when they only needed a 3-step sticky note. The chef got overwhelmed and started making mistakes.

5. The Big Conclusion

This paper proves that you don't always need to rebuild the whole kitchen (retrain the AI) to cook a new dish.

When it works: If the target language is totally new to the AI, using a "cousin" language as a bridge and showing a few examples can guide the AI to the right answer without needing expensive training.
When it fails: If the AI already knows the language well, or if you give it too many confusing examples, this trick doesn't help.

In short: Think of this as a "lightweight" translation hack. It's not a magic wand that solves everything, but for languages that are currently ignored by big tech, it's a clever, low-cost way to get a decent meal on the table.

1. Problem Statement

Large Language Models (LLMs) have achieved state-of-the-art performance in high-resource machine translation (MT) but struggle significantly with extremely low-resource languages. Standard adaptation techniques, such as large-scale parallel data collection or extensive fine-tuning, are often infeasible for the "long tail" of underrepresented languages due to data scarcity and computational costs.

The paper investigates a constrained, inference-time question: To what extent can linguistically similar "pivot" languages and few-shot in-context examples (ICL) guide LLMs to translate low-resource languages without any parameter updates (fine-tuning)?

2. Methodology

The authors propose a retrieval-augmented, inference-time prompting framework that operates on frozen, decoder-only LLMs (approx. 7B–8B parameters).

Experimental Setup:
- Target Languages: Two low-resource, underrepresented languages were selected:
  - Konkani (gom): An Indian language with ~2.35M speakers, using the Devanagari script.
  - Tunisian Arabic (aeb): A dialect with ~12M speakers, using the Arabic script (Right-to-Left).
- Pivot Languages:
  - Marathi (mar) for Konkani (linguistically related Indo-Aryan language).
  - Modern Standard Arabic (MSA) for Tunisian Arabic (standardized form of the same language family).
- Models Evaluated:
  - TowerInstruct-7B-v0.1: Specialized for translation tasks but not trained on the specific target dialects.
  - Hermes-2-Pro-Llama-3-8B: A general-purpose instruction-tuned model with multilingual capabilities.
Data Construction:
- Small parallel triplets (English $\to$ Pivot $\to$ Target) were constructed from open-source corpora (approx. 800–900 training examples per language).
- A vector datastore was created using English source sentences as keys. Sentence embeddings (via all-MiniLM-L12-v2) allow for semantic retrieval of the most similar examples.
Prompting Strategy (In-Context Learning):
- The prompt includes:
  1. System instruction defining the task (e.g., "Post-editing" or translation).
  2. Retrieved Demonstrations: Top- $k$ semantically similar English-Pivot-Target triplets.
  3. Pivot Scaffolding: The specific Pivot translation for the current input sentence (retrieved from the corpus, not generated by the model).
- Ablation: The study compares three conditions:
  1. Zero-shot (no examples).
  2. Direct Few-shot (examples only, no pivot).
  3. Pivot-Augmented (examples + pivot translation).

3. Key Contributions

Inference-Time Adaptation: Demonstrates a lightweight alternative to fine-tuning for low-resource MT, requiring no parameter updates or large-scale compute.
Pivot Analysis: Systematically evaluates whether adding a linguistically related pivot translation to few-shot prompts provides a "grounding" signal that stabilizes generation in low-resource settings.
Empirical Boundaries: Identifies that the efficacy of this approach is highly dependent on the target language's representation in the model's vocabulary and the quality of the pivot relationship.

4. Key Results

A. Impact of Pivot-Augmented Prompting

Konkani (Low Representation):
- Direct Few-shot: Provided substantial gains over zero-shot (e.g., Hermes: chrF++ 1.30 $\to$ 29.62).
- Pivot Augmentation: Yielded modest additional gains (Hermes: 29.62 $\to$ 30.34 chrF++). The pivot acted as a stabilizer for script and language identity but did not drastically improve translation competence beyond what few-shot examples provided.
Tunisian Arabic (Better Representation):
- Zero-shot: Already performed relatively well (Hermes: 24.32 chrF++).
- Pivot Augmentation: Provided negligible or inconsistent gains. The model already had sufficient latent knowledge of the script and dialect, making the pivot signal redundant.

B. Comparison with Baselines

vs. NLLB-200:
- For Konkani (unsupported by NLLB), the best Hermes pivot-augmented configuration (30.34 chrF++) slightly outperformed the NLLB distilled baseline (26.82 chrF++).
- For Tunisian Arabic (supported by NLLB), the few-shot LLMs (Hermes: 24.32 chrF++) significantly outperformed the supervised NLLB baseline (10.42 chrF++), despite having no fine-tuning.

C. Effect of Example Count ( $k$ )

Non-Monotonic Gains: Increasing the number of in-context examples ( $k$ ) did not lead to linear improvements.
Optimal $k$ : Performance often peaked at low values ( $k=1$ to $k=3$ ) and degraded with more examples, likely due to context window limits or noise from loosely related examples.
Tokenization Insight: The study found that token-to-word ratios correlate with performance. Konkani (high token-to-word ratio) benefited more from scaffolding than Tunisian Arabic (low ratio), suggesting that models struggle more with languages that are poorly represented in their subword vocabulary.

D. Pivot Selection

Using a pivot language explicitly supported by the model (e.g., Hindi for Konkani on Llama-3) did not yield systematic improvements over linguistically motivated pivots (Marathi). Native model support alone is insufficient to stabilize low-resource translation without linguistic alignment.

5. Significance and Limitations

Significance:
- Provides empirical evidence that inference-time prompting can be a viable, low-cost strategy for low-resource translation, often outperforming specialized supervised systems (like NLLB) on unsupported languages.
- Highlights that few-shot examples are the primary driver of performance, while pivot languages serve as a secondary stabilizer, particularly for languages with weak vocabulary representation.
- Offers a practical framework for communities with limited data to leverage existing open-weight LLMs.
Limitations:
- Modest Gains: Improvements are often small and sensitive to example construction.
- Metric Brittleness: Reliance on BLEU/chrF++ may underrepresent semantic adequacy in morphologically rich or dialectal languages (e.g., valid translations receiving low scores due to surface form differences).
- Data Dependency: The approach still requires access to some high-quality parallel data (for the pivot and few-shot retrieval), limiting applicability to languages with zero digital presence.
- Lack of Human Evaluation: The study relies on automatic metrics; native speaker evaluation is needed to confirm pragmatic correctness.

Conclusion

The paper concludes that while linguistically related pivot languages can guide LLM translation in low-resource settings, their utility is context-dependent. They are most effective when the target language is poorly represented in the model's vocabulary (acting as a script/language identifier) but offer diminishing returns for languages that are already partially supported. The most robust strategy involves combining semantic retrieval of few-shot examples with pivot scaffolding, rather than relying on pivots alone.

Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

1. The Problem: The "Language Gap"

2. The Solution: The "Pivot" and the "Cheat Sheet"

3. The Experiment: Two Different Chefs

4. The Results: It Depends on the Village

5. The Big Conclusion

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. Impact of Pivot-Augmented Prompting

B. Comparison with Baselines

C. Effect of Example Count (kkk)

D. Pivot Selection

5. Significance and Limitations

Conclusion

More like this

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

C. Effect of Example Count ( $k$ )