Generating High Quality Synthetic Data for Dutch Medical Conversations

This paper presents a pipeline for generating synthetic Dutch medical dialogues using a fine-tuned Large Language Model, finding that while the approach is feasible for expanding clinical NLP resources, it currently struggles to produce natural conversation flows and requires careful balancing of domain knowledge and prompting to overcome limitations in lexical variety and expression.

Cecilia Kuan, Aditya Kamlesh Parikh, Henk van den Heuvel

Published 2026-04-14
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to be a doctor. To do this well, the robot needs to listen to thousands of real conversations between actual doctors and patients. It needs to learn how they talk, what words they use, and how they handle sensitive topics like kidney disease.

The Problem:
Real medical conversations are like gold dust. They are incredibly valuable, but they are locked away in a vault because of privacy laws (like GDPR). You can't just walk into a hospital and record patients; it's illegal and unethical. Without enough "gold dust" (real data), the robot stays clumsy and doesn't understand the nuances of human care.

The Solution:
The researchers in this paper decided to build a factory that makes fake gold. They created a system to generate "synthetic" (fake but realistic) medical conversations in Dutch. They wanted to see if they could trick the robot into thinking these fake conversations were real, so the robot could learn from them without ever seeing a real patient.

Here is how they did it, broken down into simple steps:

1. The Recipe (The Model)

They used a very smart AI brain (called an LLM) that had already been taught to speak Dutch. Think of this AI as a student who has read every Dutch book in the library but hasn't studied medicine yet.

  • The Teacher: They gave the AI a few real examples of doctor-patient chats (like showing a student a few sample essays before asking them to write one).
  • The Assignment: They told the AI, "Write a conversation about kidney disease. Make sure the doctor sounds professional and the patient sounds worried. Cover these four topics: symptoms, medicine, lifestyle, and lab results."

2. The Factory Floor (The Process)

The AI started churning out conversations. They asked it to write nine different "fake" dialogues.

  • The Goal: To create a dataset that is so realistic it could be used to train other AI tools, all while keeping real patient secrets safe.

3. The Taste Test (The Evaluation)

Once the factory produced the fake conversations, the researchers had to check if they were any good. They used two different methods:

Method A: The Robot Judge (Quantitative)
They used computer programs to count things.

  • Did the speakers take turns evenly? Yes, almost perfectly.
  • Did they use medical words? Yes, they used a lot of them.
  • Verdict: The computer gave high scores. It looked like the data was perfect.

Method B: The Human Judges (Qualitative)
They hired real Dutch speakers, including actual doctors, to read the conversations.

  • The Reality Check: The humans were not impressed. They said the conversations felt "scripted" and stiff.
  • The "Uncanny Valley" Effect: Imagine a wax figure that looks almost human but moves slightly wrong. That's what the AI conversations felt like. The doctors said, "This doesn't sound like a real person talking. It sounds like a robot reading a script."
  • Specific Complaints: The AI kept saying "Hello" and "Goodbye" too many times (like restarting the conversation every time a new topic started). The sentences were too long and perfect, lacking the natural "ums," "ahs," and interruptions of real life.

The Big Lesson

The most interesting part of this paper is the mismatch between the two judges.

  • The Robot Judge said: "Great job! 10/10!"
  • The Human Judge said: "Meh. 2/5. It feels fake."

This teaches us a vital lesson: You can't just count words to measure quality. Just because a conversation has the right number of words and medical terms doesn't mean it feels human.

The Takeaway

The researchers concluded that while they can build this factory to make fake medical data, the current version is a bit too robotic. It's like a mannequin that has the right clothes but no soul.

To make it truly useful, they need to:

  1. Tweak the instructions (Prompt Engineering): Tell the AI to be messier, more natural, and less perfect.
  2. Train it better: Give the AI more specific medical training so it doesn't sound like it's translating from English.

In a nutshell: They successfully built a machine to print fake medical conversations, but the conversations still sound a bit like a robot trying to act human. It's a promising start, but the robot still has a lot of homework to do before it can truly replace real data for training medical AI.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →