Generating High Quality Synthetic Data for Dutch Medical Conversations

Imagine you are trying to teach a robot how to be a doctor. To do this well, the robot needs to listen to thousands of real conversations between actual doctors and patients. It needs to learn how they talk, what words they use, and how they handle sensitive topics like kidney disease.

The Problem:
Real medical conversations are like gold dust. They are incredibly valuable, but they are locked away in a vault because of privacy laws (like GDPR). You can't just walk into a hospital and record patients; it's illegal and unethical. Without enough "gold dust" (real data), the robot stays clumsy and doesn't understand the nuances of human care.

The Solution:
The researchers in this paper decided to build a factory that makes fake gold. They created a system to generate "synthetic" (fake but realistic) medical conversations in Dutch. They wanted to see if they could trick the robot into thinking these fake conversations were real, so the robot could learn from them without ever seeing a real patient.

Here is how they did it, broken down into simple steps:

1. The Recipe (The Model)

They used a very smart AI brain (called an LLM) that had already been taught to speak Dutch. Think of this AI as a student who has read every Dutch book in the library but hasn't studied medicine yet.

The Teacher: They gave the AI a few real examples of doctor-patient chats (like showing a student a few sample essays before asking them to write one).
The Assignment: They told the AI, "Write a conversation about kidney disease. Make sure the doctor sounds professional and the patient sounds worried. Cover these four topics: symptoms, medicine, lifestyle, and lab results."

2. The Factory Floor (The Process)

The AI started churning out conversations. They asked it to write nine different "fake" dialogues.

The Goal: To create a dataset that is so realistic it could be used to train other AI tools, all while keeping real patient secrets safe.

3. The Taste Test (The Evaluation)

Once the factory produced the fake conversations, the researchers had to check if they were any good. They used two different methods:

Method A: The Robot Judge (Quantitative)
They used computer programs to count things.

Did the speakers take turns evenly? Yes, almost perfectly.
Did they use medical words? Yes, they used a lot of them.
Verdict: The computer gave high scores. It looked like the data was perfect.

Method B: The Human Judges (Qualitative)
They hired real Dutch speakers, including actual doctors, to read the conversations.

The Reality Check: The humans were not impressed. They said the conversations felt "scripted" and stiff.
The "Uncanny Valley" Effect: Imagine a wax figure that looks almost human but moves slightly wrong. That's what the AI conversations felt like. The doctors said, "This doesn't sound like a real person talking. It sounds like a robot reading a script."
Specific Complaints: The AI kept saying "Hello" and "Goodbye" too many times (like restarting the conversation every time a new topic started). The sentences were too long and perfect, lacking the natural "ums," "ahs," and interruptions of real life.

The Big Lesson

The most interesting part of this paper is the mismatch between the two judges.

The Robot Judge said: "Great job! 10/10!"
The Human Judge said: "Meh. 2/5. It feels fake."

This teaches us a vital lesson: You can't just count words to measure quality. Just because a conversation has the right number of words and medical terms doesn't mean it feels human.

The Takeaway

The researchers concluded that while they can build this factory to make fake medical data, the current version is a bit too robotic. It's like a mannequin that has the right clothes but no soul.

To make it truly useful, they need to:

Tweak the instructions (Prompt Engineering): Tell the AI to be messier, more natural, and less perfect.
Train it better: Give the AI more specific medical training so it doesn't sound like it's translating from English.

In a nutshell: They successfully built a machine to print fake medical conversations, but the conversations still sound a bit like a robot trying to act human. It's a promising start, but the robot still has a lot of homework to do before it can truly replace real data for training medical AI.

1. Problem Statement

The development of reliable Clinical Natural Language Processing (NLP) models is significantly hindered by the scarcity of high-quality, domain-specific datasets. In the medical field, real-world data (such as Electronic Health Records and audio transcripts) is often inaccessible due to strict privacy regulations (e.g., GDPR) and ethical constraints regarding patient re-identification. While synthetic data generation offers a privacy-compliant alternative, there is a notable gap in existing literature: no prior work has successfully explored the generation of synthetic Dutch medical dialogues using Large Language Models (LLMs). Most existing research focuses on English or clinical reports/EHRs rather than conversational data.

2. Methodology

The authors propose a pipeline to generate synthetic Dutch medical dialogues using a fine-tuned LLM, leveraging real medical conversations as linguistic and structural references.

Data Source: The study utilized a real-world dataset from the Nivel Institute containing transcriptions of nephrology (kidney specialist) consultations.
- Few-Shot Examples: Two files were used to create input-output pairs for few-shot learning, focusing on stylistic features (turn structure, tone) rather than content coherence.
- Reference Material: Seven files were chunked into 1,000-token segments to serve as the source material for generating new dialogues.
Model Selection:
- Base Model: Meta's Llama-3-8B-Instruct was initially tested but showed inconsistencies.
- Final Model: Llama-3-ChocoLlama-8B-Instruct, a model fine-tuned on an extensive corpus of native Dutch text (~32 billion tokens). This model was chosen for its superior instruction-following capabilities in Dutch and local deployment options (privacy-preserving).
Prompt Engineering & Workflow:
- Preprocessing: Target samples were summarized into bullet points to fit within the model's context window (~8,000 tokens) while preserving structural guidance.
- Prompt Design: The prompt defined the LLM's role (medical research assistant), speaker roles (Doctor/Patient), and specific constraints:
  - Topics: Symptoms, Medication Use, Lifestyle, Laboratory Results.
  - Structure: One sentence per turn (to mimic the target style), specific turn lengths, and integration of medical terminology.
  - Continuity: The last 150 words of a generated dialogue were passed as context for the next topic to ensure flow.
- Generation: Four variations were generated per topic, and the best were concatenated to form a single synthetic dialogue.

3. Evaluation Strategy

The study employed a dual-evaluation approach to assess the quality of the generated dialogues:

Quantitative Metrics:
- Structural: Turn alternation rate, greeting/closing repetition, Average Sentence Length (ASL), and Sentences Per Turn (SPT).
- Lexical: Type-Token Ratio (TTR) and Mean Segmental Type-Token Ratio (MSTTR) to measure lexical diversity.
- Role Consistency: Keyword matching against predefined lexicons for doctors (technical terms) and patients (symptoms/emotions) derived from SNOMED CT and Dutch medical ontologies.
- Topic Coverage: Keyword matching for the four predefined medical topics.
Qualitative Analysis:
- Human Evaluation: Conducted by five native Dutch speakers (four medical practitioners, one non-medical).
- Rubric: Rated on a 5-point scale across five categories: Coherence, Consistency, Fluency, Relevance, and Clinical Use.
- Reliability: Inter-Rater Reliability (IRR) was measured using Krippendorff's $\alpha$ .

4. Key Results

The evaluation revealed a significant discrepancy between quantitative metrics and qualitative human judgment:

Quantitative Findings:
- High Structural Regularity: The turn alternation rate was nearly perfect (0.973), suggesting scripted, unnatural turn-taking rather than spontaneous conversation (real conversations often have overlaps and interjections).
- Lexical Diversity: High MSTTR scores (0.834) indicated good local lexical variation, likely driven by specialized medical terminology.
- Role Inconsistency: The mean role consistency score was extremely low (0.012), well below the heuristic baseline (0.05–0.35). The model failed to consistently differentiate between the doctor's technical vocabulary and the patient's descriptive language.
- Instruction Adherence: The model struggled with length constraints, producing longer sentences (Mean ASL 16.18) and multiple sentences per turn (Mean SPT 2.14) despite instructions for short, single-sentence turns.
Qualitative Findings:
- Low Scores: Human raters gave slightly below-average scores (mean ~2.5/5).
- Key Issues: Raters noted a lack of domain specificity, unnatural phrasing resembling "translated English," and abrupt transitions.
- Disagreement: Inter-Rater Reliability was low ( $\alpha < 0.12$ ), indicating high subjectivity and difficulty in evaluating conversational naturalness.
Correlation: Weak correlation ( $\rho$ ) was found between quantitative metrics and human scores, confirming that numerical metrics alone cannot capture linguistic quality or pragmatic naturalness.

5. Key Contributions

First Synthetic Dutch Medical Dialogue Pipeline: This work presents the first framework for generating synthetic Dutch medical conversations using LLMs, addressing a critical gap in non-English clinical NLP resources.
Evaluation Framework: It establishes a combined quantitative and qualitative evaluation protocol for synthetic medical data, highlighting the limitations of automated metrics in assessing conversational nuance.
Insights on Prompting & Fine-tuning: The study demonstrates that while fine-tuning on Dutch text improves fluency, translated datasets (used in the ChocoLlama base) can lead to unnatural phrasing. It also shows that overly rigid prompts can result in "scripted" rather than natural interactions.
Privacy-Compliant Resource: The generated dataset offers a FAIR (Findable, Accessible, Interoperable, Reusable) resource for training and benchmarking Dutch clinical NLP models without violating patient privacy.

6. Significance and Future Work

The study concludes that generating synthetic Dutch medical dialogues is feasible but currently insufficient for high-stakes clinical applications without further refinement. The generated data serves as a foundational resource but requires:

Domain-Specific Fine-tuning: Moving beyond general Dutch instruction tuning to specific medical dialogue corpora.
Prompt Refinement: Adjusting prompts to encourage more natural, overlapping, and variable turn-taking patterns.
Human-in-the-Loop: Emphasizing the necessity of expert human review, as automated metrics failed to detect subtle but critical flaws in clinical relevance and naturalness.

Future work will focus on improving dialogue realism through better prompt engineering, refining evaluation protocols, and expanding into synthetic audio generation and medical ontology mapping.

Generating High Quality Synthetic Data for Dutch Medical Conversations

1. The Recipe (The Model)

2. The Factory Floor (The Process)

3. The Taste Test (The Evaluation)

The Big Lesson

The Takeaway

1. Problem Statement

2. Methodology

3. Evaluation Strategy

4. Key Results

5. Key Contributions

6. Significance and Future Work

More like this

Self-Calibrating Language Models via Test-Time Discriminative Distillation

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

GIANTS: Generative Insight Anticipation from Scientific Literature

Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering