MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Imagine you are trying to teach a brilliant but inexperienced medical student (a Large Language Model) how to speak French and diagnose patients. The problem? There are very few textbooks written in French for this specific student. Most medical knowledge is in English, or it's locked away in private hospital records that no one can use.

This paper, MedInjection-FR, is like a master teacher who decides to build a custom curriculum using three different types of study materials to get the job done. They want to know: Which mix of materials makes the student the best doctor?

Here is the breakdown of their experiment using simple analogies:

1. The Three "Study Materials" (Data Sources)

The researchers created a massive library of 571,000 practice questions and answers. They divided these into three buckets:

The "Native" Bucket (The Real Deal):
- What it is: Questions written by real French doctors and educators, taken from actual French medical exams and textbooks.
- The Analogy: This is like studying with a local French professor. The language is perfect, the cultural context is right, and the medical facts are authentic.
- Result: This was the most effective material. When the student studied only this, they performed the best.
The "Translated" Bucket (The Foreign Textbook):
- What it is: Famous English medical questions that were automatically translated into French using advanced AI.
- The Analogy: This is like taking a great American medical textbook and translating it. The medical facts are still good, but the phrasing might sound a little "off" or stiff, like someone speaking with a heavy accent.
- Result: It was helpful, but not quite as good as the native material. However, when mixed with the native material, it helped the student see things from different angles.
The "Synthetic" Bucket (The AI-Generated Notes):
- What it is: Questions and answers created entirely by AI, based on real medical case reports.
- The Analogy: This is like having a very smart robot tutor that reads the textbooks and writes its own practice quizzes. It's creative and covers many topics, but sometimes the robot might hallucinate (make up facts) or use weird phrasing.
- Result: On its own, this was the weakest material. The student got confused by the robot's occasional errors. But, when used alongside the real professor's notes, it added variety and helped the student handle tricky questions.

2. The Experiment: Mixing the Ingredients

The researchers didn't just pick one bucket; they tried mixing them in different recipes to see what happened. They tested seven different "diets" for the AI student:

Only Native
Only Translated
Only Synthetic
Native + Translated
Native + Synthetic
Translated + Synthetic
All three mixed together

The Big Discovery:
The best recipe wasn't just the "Native" bucket, even though that was the strongest ingredient. The best performance came from mixing the Native material with the Translated material.

Think of it like cooking a stew:

Native data is the high-quality, fresh meat (the foundation).
Translated data is a rich broth that adds depth and volume.
Synthetic data is a spice. Too much spice ruins the dish, but a little bit adds complexity.

When they mixed the fresh meat (Native) with the rich broth (Translated), the stew tasted better than using just the meat alone. It made the AI more robust and able to handle a wider variety of questions.

3. The Grading System (Evaluation)

How did they know if the student was actually learning? They used three ways to grade:

The Calculator (Automatic Metrics): Checking if the words match exactly. Problem: In medicine, you can say the same thing in different ways. A calculator might think two correct answers are different just because the words aren't identical.
The Robot Judge (LLM-as-a-Judge): Using a super-smart AI to grade the answers. Problem: This robot sometimes liked long, wordy answers even if they weren't very accurate. It was easily fooled by "verbosity" (talking too much).
The Human Doctor (Human Review): A real doctor read the answers. This was the gold standard.

The Lesson: The Robot Judge was actually the best at mimicking the Human Doctor, but it had a bias toward long answers. This taught the researchers that we need to be careful when using AI to grade medical AI, because sometimes "longer" doesn't mean "better."

The Takeaway for the Real World

This paper solves a major problem: What do you do when you don't have enough data in your own language?

The answer is: Don't panic. You don't need millions of perfect, native-language examples to build a great medical AI.

You need a core of high-quality, native data (the foundation).
You can boost it with translated data from other languages (to add volume and diversity).
You can sprinkle in some AI-generated data (to fill gaps), as long as you keep the native data as the anchor.

In short, MedInjection-FR proves that you can build a world-class French medical AI even if French medical data is scarce, as long as you know how to mix your ingredients correctly.

Here is a detailed technical summary of the paper "MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning."

1. Problem Statement

Large Language Models (LLMs) require Instruction Tuning (Supervised Fine-Tuning or SFT) to effectively follow domain-specific prompts. However, in specialized fields like biomedicine, there is a critical scarcity of high-quality, native French instruction-response datasets.

The Gap: Existing biomedical instruction datasets are predominantly English-centric. French alternatives are limited, and creating high-quality native data is resource-intensive due to the need for expert medical knowledge and privacy constraints.
The Question: Can alternative data sources—specifically synthetic data (AI-generated) and translated data (English-to-French)—effectively complement or replace scarce native French data for instruction tuning without degrading model performance or introducing factual errors?

2. Methodology

A. Dataset Construction: MedInjection-FR

The authors introduced MedInjection-FR, a large-scale French biomedical instruction dataset containing 571,436 instruction-response pairs. The dataset is composed of three distinct, complementary sources:

Native Data (77k pairs): Derived from authentic French medical resources, including:
- Educational platforms (S-Editions).
- National medical exam repositories (MediQAl, FrenchMedMCQA).
- Medical case repositories (mlabonne) and Wikipedia-based biomedical QA (FrBMedQA).
- Note: Some native data was reformatted into multiple-choice questions (MCQs) using GPT-4o-mini for standardization.
Synthetic Data (76k pairs): Automatically generated using GPT-4o from French clinical cases (DEFT-2021, DIAMED) and biomedical abstracts (MORFITT). Tasks included summarization, differential diagnosis, treatment planning, and drug interaction checks. Quality was validated by four different LLMs (including MedGemma-27B).
Translated Data (417k pairs): The largest subset, created by translating established English biomedical datasets (MedQA, PubMedQA, MedMCQA, MMLU-Pro, etc.) into French using Gemini 2.0 Flash and GPT-4o-mini. Translation quality was verified against WMT 2024 benchmarks, showing high semantic fidelity.

Task Formats: The dataset includes Open-Ended QA (OEQ), Multiple-Choice with single correct answer (MCQU), and Multiple-Choice with multiple correct answers (MCQ).

B. Experimental Framework

To isolate the impact of data provenance, the authors conducted a controlled ablation study:

Base Model: Qwen-4B-Instruct (chosen for its multilingual capabilities and 4B parameter size, balancing capacity and cost).
Training Strategy: Supervised Fine-Tuning (SFT) using DoRA (Weight Decomposed Low-Rank Adaptation).
Configurations: Seven training setups were tested, each using a balanced sample size of 33,493 examples to ensure fair comparison:
1. NAT: Native only.
2. TRAD: Translated only.
3. SYN: Synthetic only.
4. NAT-TRAD: Native + Translated.
5. NAT-SYN: Native + Synthetic.
6. TRAD-SYN: Translated + Synthetic.
7. ALL: All three sources combined.

C. Evaluation Protocol

Multiple-Choice (MCQ/MCQU): Evaluated using Exact Match (EM) and Hamming scores (for partial correctness). To mitigate position bias, the authors performed three runs with randomized answer orderings.
Open-Ended QA (OEQ): Evaluated using:
- Automatic Metrics: BLEU, ROUGE, METEOR, BERTScore.
- LLM-as-a-Judge: A meta-evaluation study compared five LLMs against a licensed physician. MedGemma-27B showed the highest correlation ( $r=0.61$ ) with human expert ratings.
- Human Review: A physician rated 100 samples for factual equivalence.

3. Key Results

A. Multiple-Choice Performance

Native Data Dominance: Models trained only on native data (NAT) achieved the highest performance (40.59 EM under constrained decoding), significantly outperforming the base model. This highlights the importance of linguistic and cultural alignment in medical reasoning.
Synthetic Data Limitations: The SYN-only model performed the worst (29.73 EM), indicating that synthetic data alone introduces stylistic noise and factual inconsistencies that hinder alignment.
Translated Data Nuance: The TRAD-only model showed modest gains in greedy decoding but failed to improve under constrained decoding, suggesting the gains were artifacts of decoding strategy rather than true accuracy improvements.
Synergy of Mixed Data:
- NAT-TRAD achieved the best overall performance (41.37 EM), demonstrating that combining linguistically aligned native data with the conceptual diversity of translated data yields superior generalization.
- NAT-SYN also performed competitively (39.25 EM), suggesting synthetic data can add task diversity when anchored by native supervision.
- TRAD-SYN (non-native mix) offered minimal improvement over TRAD alone, confirming that native data is essential for stability.

B. Open-Ended QA Performance

Metric Sensitivity: Traditional metrics (BLEU, ROUGE) showed moderate sensitivity to fine-tuning but failed to capture factual correctness.
Verbosity Bias: The base model produced significantly longer outputs (28x reference length) and received higher LLM-judge scores. However, statistical analysis (Spearman/Kendall) showed no consistent monotonic relationship between output length and accuracy across fine-tuned models.
Conciseness: Instruction tuning on MedInjection-FR encouraged models to produce more concise, clinically relevant answers compared to the verbose base model.

4. Key Contributions

MedInjection-FR Dataset: The release of the first open, large-scale (571k pairs) French biomedical instruction dataset, integrating native, synthetic, and translated sources.
Systematic Provenance Analysis: A controlled experimental framework quantifying how data origin (native vs. synthetic vs. translated) impacts LLM adaptation in a low-resource language context.
Empirical Insights: Evidence that while native data is superior, heterogeneous supervision (specifically Native + Translated) can achieve comparable or superior performance to native-only training when native data is scarce.
Evaluation Benchmarking: A rigorous comparison of evaluation methods, highlighting the limitations of lexical overlap metrics and the necessity of domain-specific LLM judges (like MedGemma) for biomedical QA.

5. Significance and Implications

Low-Resource Adaptation: The study provides a viable roadmap for adapting LLMs to low-resource languages (like French) in specialized domains. It proves that when native data is limited, combining it with high-quality translated data is a highly effective strategy to boost performance without sacrificing linguistic coherence.
Data Quality vs. Quantity: The results emphasize that data authenticity (native) is the primary driver of performance, while diversity (synthetic/translated) acts as a complementary enhancer. Relying solely on synthetic or translated data leads to suboptimal results.
Evaluation Best Practices: The paper underscores the unreliability of standard NLP metrics for medical QA and the critical need for human-in-the-loop or domain-specialized LLM judges to assess factual correctness, while also warning against verbosity biases in evaluation.

In conclusion, MedInjection-FR demonstrates that a strategic mix of native and augmented data sources can overcome the scarcity of native medical instructions, enabling robust French biomedical LLMs that are both factually accurate and linguistically natural.