Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Imagine you have a brilliant, world-traveled chef (a Large Language Model) who can cook amazing meals in English. This chef knows everything about the world, can describe pictures, and answer questions about them. However, if you ask them to cook a traditional Basque dish or describe a picture using the Basque language, they stumble. They might understand the ingredients, but they don't know the local recipes or the specific words to describe the flavors.

This paper is about teaching that world-famous chef how to cook delicious Basque meals without needing to hire a new chef from the Basque Country from scratch.

Here is the story of how the researchers did it, broken down into simple concepts:

1. The Problem: The "Language Gap"

Right now, the smartest AI models are like chefs trained mostly on English recipes. If you ask them about low-resource languages (languages with very little data on the internet, like Basque), they perform poorly. It's like asking a French chef to cook a traditional Scottish stew; they might guess, but the result won't be authentic or accurate.

2. The Solution: Building a New Kitchen

Since there were no existing "Basque recipe books" (datasets) for teaching AI about images and text, the researchers had to create them from scratch.

The Translation Factory: They took huge libraries of English image descriptions and questions (like "What is in this picture?") and translated them into Basque.
The Result: They built a massive new library containing over 3 million image-text pairs in Basque. Think of this as creating a massive, high-quality cookbook specifically for Basque cuisine.

3. The Experiment: Two Different Chefs

The researchers tested two different "chefs" (AI backbones) to see who could learn Basque best:

The English Specialist (Llama): A chef who only speaks English and knows the world, but has never heard of Basque.
The Basque Native (Latxa): A chef who already speaks Basque fluently and knows the local culture.

They trained both chefs using a mix of English and Basque "recipes" (data) to see who would become the better Basque cook.

4. The Big Surprises (The Findings)

Surprise #1: You Don't Need a Full Basque Library
The researchers thought they needed a library that was 100% Basque to get good results. Instead, they found that just 20% Basque data mixed with 80% English data was enough to create a top-tier Basque chef.

The Analogy: It's like learning to cook a specific regional dish. You don't need to live in that region your whole life. If you have a great base of general cooking skills (English) and just a few specific local recipes (20% Basque), you can still make an amazing meal.

Surprise #2: The English Chef is Just as Good
They expected the "Basque Native" chef (Latxa) to be much better because they already knew the language. But the "English Specialist" (Llama) performed almost exactly the same!

The Analogy: It turns out that if you teach a generalist chef a few specific local recipes, they can cook the local dish just as well as a local chef who was born there. You don't need a native speaker to build a strong Basque AI; you just need a smart generalist with a few Basque instructions.

Surprise #3: Text-Only Practice Helps
They also found that if the chef practiced only writing in Basque (without looking at pictures), it actually helped them get better at looking at pictures and describing them in Basque.

The Analogy: It's like practicing your vocabulary by reading a book in a foreign language. Even if you aren't looking at pictures, reading the words helps your brain understand how to describe those pictures later.

5. The Conclusion: A Blueprint for the World

The main takeaway is that we don't need to build a massive, expensive, native-language AI from the ground up for every small language in the world.

Instead, we can take a powerful, general AI (like the English chef), give it a small dose of local data (20% Basque), and maybe some text-only practice, and it will become a strong, capable AI for that language.

Why does this matter?
This is a "recipe" that can be used for hundreds of other low-resource languages (like Welsh, Catalan, or indigenous languages) that currently get ignored by big tech. It opens the door for these languages to join the AI revolution without needing millions of dollars in data collection.

In short: You don't need a native speaker to teach an AI a new language; you just need a smart teacher and a few good textbooks.

Here is a detailed technical summary of the paper "Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque."

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance in English for tasks like image captioning, visual question answering (VQA), and OCR. However, there is a significant performance gap for low-resource languages (LRLs).

Data Scarcity: LRLs like Basque lack large-scale, aligned image-text datasets. Basque has roughly 1,000× less text data than English in Common Crawl and no native multimodal datasets.
Proprietary Dominance: Current high-performing MLLMs are proprietary (e.g., GPT-4o, Gemini), and their architectures/training data are opaque. Open-weight alternatives often underperform in LRLs because they are primarily trained on English resources.
Research Gap: It is unclear how to efficiently train open MLLMs for LRLs. Specifically, it is unknown:
- What proportion of native multimodal data is required?
- Is a language-specific backbone LLM (e.g., a Basque-tuned LLM) necessary, or can an English-centric backbone suffice?
- How does text-only data transfer to multimodal capabilities in low-resource settings?

2. Methodology

A. Dataset Creation

The authors created the first open multimodal datasets for Basque using machine translation from English-centric resources, validated by human annotators.

Training Data:
- Stage 1 (Alignment): Translated CC3M (Conceptual Captions) into Basque ( $CC3MEus$ ), resulting in 2.8 million image-caption pairs.
- Stage 2 (Instruction Tuning): Translated Pixmo-Ask-Model-Anything ( $Pixmo-AMAEus$ ) using a large LLM (Latxa-70B) to maintain coherence in multi-sentence answers, resulting in 146k instruction pairs.
Evaluation Benchmarks:
- Close-ended: Translated VQAv2, A-OKVQA, and PixMo-CapQA (filtered for yes/no questions).
- Open-ended: Translated WildVision (199 samples) for generative quality assessment.
- Validation: All benchmarks were validated by 4 native Basque speakers, achieving high inter-annotator agreement (0.88–0.95).

B. Model Architecture

The study utilizes a Late-Fusion architecture (similar to LLaVA), consisting of:

Visual Encoder: CLIP ViT-Large-Patch14-336.
Connector: A single-layer linear projection to map visual embeddings to the LLM space.
Backbone LLMs: Two variants were compared:
- Llama-3.1-8B-Instruct: English-centric.
- Latxa-Llama-3.1-8B-Instruct: A Basque-adapted variant of Llama-3.1.

C. Training Strategy

The models were trained using a standard two-stage procedure:

Vision-Language Alignment: Freezing the LLM and training only the connector on image-caption pairs.
Multimodal Instruction Tuning: Fine-tuning both the connector and the LLM on instruction-following data.

Experimental Variables:

Data Mixtures: The authors experimented with varying ratios of Basque vs. English multimodal data in Stage 2 (0%, 20%, 80%, 100% Basque).
Text-Only Augmentation: In some experiments, text-only instructions (from the Magpie dataset) were added to the training mix to counteract potential performance degradation on text-only tasks.

3. Key Contributions

Resource Release: The creation and open release of the first large-scale Basque multimodal training and evaluation datasets ( $CC3MEus$ , $Pixmo-AMAEus$ , and translated benchmarks).
Optimal Data Ratios: Demonstrating that low ratios of native multimodal data (approx. 20%) are sufficient to achieve strong performance on Basque benchmarks, provided the model is trained on a mix with English data.
Backbone Independence: Proving that a language-specific backbone LLM is not required. An English-centric LLM (Llama-3.1) performs comparably to a Basque-adapted LLM (Latxa) when fine-tuned with appropriate multimodal data.
Cross-Modal Transfer: Showing that Basque text-only data can significantly boost multimodal performance in Basque, even when Basque multimodal data is scarce or absent. This suggests a mechanism for transferring multimodal capabilities from English to LRLs via text-only instructions.

4. Key Results

Performance on Close-Ended Benchmarks:
- Models trained with 80% Basque / 20% English data achieved the highest average accuracy (~0.62) across VQAv2, A-OKVQA, and PixMo-CapQA.
- Surprisingly, models trained with only 20% Basque multimodal data (and 80% English) performed nearly as well as those trained with 100% Basque data on Basque benchmarks.
- Catastrophic Forgetting: Training exclusively on Basque multimodal data caused a drastic drop in English performance, highlighting the need for a mixed-language training set to maintain multilingual capabilities.
Impact of Text-Only Data:
- Adding Basque text-only instructions to the training mix mitigated the degradation of text-only capabilities (measured on BertaQA) often seen in MLLMs.
- Crucially, for models trained only on English multimodal data, adding Basque text-only instructions significantly improved their Basque multimodal performance, proving cross-lingual modality transfer.
Backbone Comparison (Latxa vs. Llama):
- Close-ended: Latxa and Llama backbones showed negligible differences in performance.
- Open-ended (WildVision): Human evaluation showed 54% ties between the two models. While Latxa had a slight edge, the difference was not statistically significant given the sample size. This confirms that a specialized Basque backbone is not a prerequisite for a strong Basque MLLM.
Evaluation:
- Automated evaluation using GPT-5 as a judge showed moderate agreement with human annotators (Cohen's Kappa 0.33) but tended to produce fewer "ties," suggesting human evaluation remains necessary for open-ended generation in LRLs.

5. Significance and Implications

Scalability for Low-Resource Languages: The findings provide a cost-effective pathway for developing MLLMs for the hundreds of low-resource languages that lack native multimodal datasets. Developers do not need to wait for massive native image-text corpora; they can leverage existing English multimodal data combined with a small fraction of native data and native text-only instructions.
Democratization of MLLMs: By proving that open-weight models (Llama-3.1) can match proprietary performance in LRLs without needing a specialized language backbone, this work lowers the barrier to entry for creating multilingual AI.
Future Directions: The authors suggest future work should focus on multimodal cultural knowledge (currently missing due to translation reliance) and exploring zero-shot multimodal transfer (training MLLMs without any native multimodal data, relying solely on text transfer).

In conclusion, this paper establishes that high-quality open MLLMs for low-resource languages can be built by strategically mixing English multimodal data with minimal native multimodal data and native text-only instructions, without requiring a language-specific pre-trained backbone.