MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

The paper introduces MedInjection-FR, a large-scale French biomedical instruction dataset combining native, synthetic, and translated sources, and demonstrates through controlled experiments that while native data yields the best performance, strategically mixing these sources effectively mitigates the scarcity of high-quality French medical instruction data for fine-tuning large language models.

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a brilliant but inexperienced medical student (a Large Language Model) how to speak French and diagnose patients. The problem? There are very few textbooks written in French for this specific student. Most medical knowledge is in English, or it's locked away in private hospital records that no one can use.

This paper, MedInjection-FR, is like a master teacher who decides to build a custom curriculum using three different types of study materials to get the job done. They want to know: Which mix of materials makes the student the best doctor?

Here is the breakdown of their experiment using simple analogies:

1. The Three "Study Materials" (Data Sources)

The researchers created a massive library of 571,000 practice questions and answers. They divided these into three buckets:

  • The "Native" Bucket (The Real Deal):

    • What it is: Questions written by real French doctors and educators, taken from actual French medical exams and textbooks.
    • The Analogy: This is like studying with a local French professor. The language is perfect, the cultural context is right, and the medical facts are authentic.
    • Result: This was the most effective material. When the student studied only this, they performed the best.
  • The "Translated" Bucket (The Foreign Textbook):

    • What it is: Famous English medical questions that were automatically translated into French using advanced AI.
    • The Analogy: This is like taking a great American medical textbook and translating it. The medical facts are still good, but the phrasing might sound a little "off" or stiff, like someone speaking with a heavy accent.
    • Result: It was helpful, but not quite as good as the native material. However, when mixed with the native material, it helped the student see things from different angles.
  • The "Synthetic" Bucket (The AI-Generated Notes):

    • What it is: Questions and answers created entirely by AI, based on real medical case reports.
    • The Analogy: This is like having a very smart robot tutor that reads the textbooks and writes its own practice quizzes. It's creative and covers many topics, but sometimes the robot might hallucinate (make up facts) or use weird phrasing.
    • Result: On its own, this was the weakest material. The student got confused by the robot's occasional errors. But, when used alongside the real professor's notes, it added variety and helped the student handle tricky questions.

2. The Experiment: Mixing the Ingredients

The researchers didn't just pick one bucket; they tried mixing them in different recipes to see what happened. They tested seven different "diets" for the AI student:

  • Only Native
  • Only Translated
  • Only Synthetic
  • Native + Translated
  • Native + Synthetic
  • Translated + Synthetic
  • All three mixed together

The Big Discovery:
The best recipe wasn't just the "Native" bucket, even though that was the strongest ingredient. The best performance came from mixing the Native material with the Translated material.

Think of it like cooking a stew:

  • Native data is the high-quality, fresh meat (the foundation).
  • Translated data is a rich broth that adds depth and volume.
  • Synthetic data is a spice. Too much spice ruins the dish, but a little bit adds complexity.

When they mixed the fresh meat (Native) with the rich broth (Translated), the stew tasted better than using just the meat alone. It made the AI more robust and able to handle a wider variety of questions.

3. The Grading System (Evaluation)

How did they know if the student was actually learning? They used three ways to grade:

  1. The Calculator (Automatic Metrics): Checking if the words match exactly. Problem: In medicine, you can say the same thing in different ways. A calculator might think two correct answers are different just because the words aren't identical.
  2. The Robot Judge (LLM-as-a-Judge): Using a super-smart AI to grade the answers. Problem: This robot sometimes liked long, wordy answers even if they weren't very accurate. It was easily fooled by "verbosity" (talking too much).
  3. The Human Doctor (Human Review): A real doctor read the answers. This was the gold standard.

The Lesson: The Robot Judge was actually the best at mimicking the Human Doctor, but it had a bias toward long answers. This taught the researchers that we need to be careful when using AI to grade medical AI, because sometimes "longer" doesn't mean "better."

The Takeaway for the Real World

This paper solves a major problem: What do you do when you don't have enough data in your own language?

The answer is: Don't panic. You don't need millions of perfect, native-language examples to build a great medical AI.

  • You need a core of high-quality, native data (the foundation).
  • You can boost it with translated data from other languages (to add volume and diversity).
  • You can sprinkle in some AI-generated data (to fill gaps), as long as you keep the native data as the anchor.

In short, MedInjection-FR proves that you can build a world-class French medical AI even if French medical data is scarce, as long as you know how to mix your ingredients correctly.