SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

Imagine you are trying to teach a robot how to be a doctor. You want the robot to learn how real doctors talk to each other when they are confused about a patient's case, debating treatments, and sharing their expertise.

The problem? Real doctor conversations are like top-secret vaults. They contain sensitive patient information, and strict privacy laws (like HIPAA in the US or GDPR in Europe) lock those vaults tight. Doctors are also often afraid to share their raw thoughts because they worry about being judged or sued if they make a mistake.

So, how do you train the robot without breaking the law or hurting anyone's feelings?

Enter SynDocDis, a new "recipe" created by researchers Beny Rubinstein and Sérgio Matos. Think of it as a high-tech ghostwriter that writes fake doctor conversations that sound 100% real but contain zero real secrets.

Here is how it works, broken down into simple analogies:

1. The "Skeleton" vs. The "Flesh"

Usually, to write a story, you need the whole body. But for privacy, you can't use the whole body.

The Old Way: Trying to use real patient records is like trying to build a house using the actual bricks from someone else's home. It's dangerous and illegal.
The SynDocDis Way: Instead of the whole house, the researchers take just the blueprint (the metadata). They strip away the patient's name, address, and face, leaving only the "bones" of the case: Patient is a 69-year-old male with a specific type of cancer. He had surgery. Now we need to decide on medication.
The Magic: They feed these "bones" into a powerful AI (a Large Language Model). The AI acts like a master sculptor, using the blueprint to grow new "flesh" (the conversation) around it. The result is a brand-new, synthetic conversation that never happened, but feels exactly like it did.

2. The "Director's Script" (CIDI Framework)

You can't just tell the AI, "Write a chat between doctors." It might sound robotic or make things up.
The researchers gave the AI a very specific script called CIDI (Context, Instructions, Details, Input).

The Analogy: Imagine a movie director talking to actors. Instead of saying, "Just act natural," the director says: "You are Dr. Smith, a grumpy but brilliant oncologist. You need to challenge Dr. Jones's idea about the chemotherapy. Use big medical words, but keep it clear. Oh, and make sure you cite a study from 2023."
The AI follows these strict instructions to ensure the fake conversation sounds like a real, high-level medical debate, complete with disagreements, clarifications, and expert advice.

3. The "Taste Test"

After the AI wrote 9 different fake medical debates (mostly about cancer and liver issues), the researchers didn't just trust the computer. They called in five real doctors to taste-test the food.

The Review: These doctors read the fake chats and rated them on a scale of 1 to 5.
The Verdict: The results were delicious!
- Communication: The fake doctors sounded incredibly natural (4.4 out of 5). They used the right jargon, listened to each other, and argued politely.
- Medical Accuracy: The advice given was mostly correct and relevant (4.1 out of 5).
- Privacy: Not a single real patient was mentioned. The "ghost" conversations were safe to share.

Why Does This Matter?

Think of this framework as a safe training ground.

For AI: It gives robots a massive library of "practice games" to learn how to be better medical assistants without ever seeing a real patient's private data.
For Doctors: It creates a way to share knowledge and debate tricky cases without fear of legal trouble.
For Students: It provides realistic scenarios for medical students to learn how to think like a specialist.

The One Catch

The researchers admitted that sometimes the AI's "references" (the studies it cites) were a little old, and sometimes the fake doctors didn't argue with each other as much as real ones do. It's like a new actor who knows the lines perfectly but hasn't quite mastered the art of improvising a wild argument yet. But with more practice and better "scripts," this is expected to get even better.

In a nutshell: SynDocDis is a privacy-safe magic trick. It takes the essence of real medical problems and uses AI to conjure up realistic, safe, and educational conversations that help us build better medical tools for the future.

1. Problem Statement

The integration of Large Language Models (LLMs) into medicine has advanced significantly in areas like summarization and patient-physician interaction. However, there is a critical gap in synthesizing physician-to-physician discussions regarding patient cases.

Data Scarcity & Privacy: Real-world physician discussions contain rich clinical reasoning but are inaccessible due to strict privacy regulations (e.g., HIPAA, GDPR) and ethical concerns regarding re-identification, even in de-identified forms.
Liability & Culture: Physicians are often reluctant to share decision-making processes due to fear of scrutiny and liability, limiting the availability of high-quality training data for AI agents.
Limitations of Existing Synthetic Data: Current synthetic medical data generation focuses primarily on patient-provider interactions or structured records (e.g., discharge summaries). There is a lack of frameworks capable of generating multi-speaker, high-fidelity dialogues between experts that preserve clinical accuracy and natural communication dynamics using only de-identified inputs.

2. Methodology: The SynDocDis Framework

The authors propose SynDocDis, a novel framework designed to generate synthetic physician dialogues based on de-identified metadata rather than raw text or patient data.

A. Data Collection (Input)

Instead of using raw chat logs, the system uses structured metadata extracted from real-world physician discussions (via professional medical chat groups). This metadata includes:

Participant Profiles: Roles, specialties, and seniority (e.g., "Head of Department").
Case Details: De-identified patient demographics, diagnosis, treatment history, and specific clinical questions.
Interaction Metrics: Number of responses, variability of opinions, and perceived value of the discussion.

B. Generation Framework (CIDI)

The core of the methodology is the Context-Instructions-Details-Input (CIDI) prompting framework, applied to a general-purpose LLM (specifically GPT-4 in this study).

Context: Establishes the persona of experienced physicians (e.g., Heads of Units) in a professional setting.
Instructions: Uses a Chain-of-Thought (CoT) approach to guide the model step-by-step:
1. Create a mix of supportive, exploratory, and challenging responses.
2. Ensure the "Case Owner" actively participates.
3. Explicitly count and list unique physicians and total response turns.
4. Cite external references (guidelines/papers) to enhance evidence-based reasoning.
Details: Incorporates specific medical terminology and clinical nuances derived from the metadata.
Input: The de-identified case metadata.
Techniques: The framework utilizes emotion prompting (using capitalization for emphasis) and role-playing to simulate natural conversational dynamics, including active listening and varying "schools of thought."

C. Evaluation Protocol

The generated dialogues were evaluated by five practicing physicians across diverse specialties (Intensive Medicine, General Surgery, Oncology, Ophthalmology) and geographical locations.

Scenarios: 9 distinct clinical scenarios (Oncology and Hepatology).
Metrics: Evaluated on a 5-point Likert scale across two main categories:
1. Medical Content Quality: Clinical accuracy, evidence-based principles, relevance, and comprehensiveness.
2. Communication Effectiveness: Clarity/coherence, terminology usage, active listening, and variability/diversity of responses.
Statistical Analysis: Inter-rater reliability was calculated using weighted Fleiss' $\kappa$ .

3. Key Contributions

Novel Framework (SynDocDis): The first framework to condition LLM generation on shareable, de-identified metadata to create multi-speaker physician discussions, addressing the privacy-utility trade-off.
Structured Prompting (CIDI): Introduction of a specific prompting architecture that enforces role consistency, response counting, and evidence citation, ensuring the output mimics real clinical discourse.
Comprehensive Expert Evaluation: A rigorous validation involving 360 item-level judgments by domain experts, providing a benchmark for synthetic medical dialogue quality.
Open Resources: Publication of the prompts, synthetic dialogues, and expert ratings to facilitate reproducibility and further research.

4. Results

The evaluation demonstrated that SynDocDis successfully generates high-quality synthetic dialogues:

Communication Effectiveness: Achieved a mean score of 4.4/5. Over 98% of discussions were rated "Excellent" or "Good" for clarity, coherence, and terminology usage.
Medical Content Quality: Achieved a mean score of 4.1/5.
- 91% of evaluations rated the discussions as "Excellent" or "Good" for Clinical Relevance.
- 78% rated them as "Excellent" or "Good" for Clinical Accuracy.
Inter-Rater Reliability: Substantial agreement among the physician evaluators was found ( $\kappa = 0.70$ , 95% CI: 0.67–0.73).
Privacy: The framework successfully maintained patient and clinician anonymity by relying solely on metadata.

Limitations Identified:

Evidence-Based Reasoning: 18% of evaluations were rated below "Adequate" for this metric, suggesting a need for Retrieval-Augmented Generation (RAG) to integrate up-to-date references.
Variability: Some discussions lacked diversity of opinion (rated "Limited" in 5/45 cases), partly because the input metadata mirrored real-world discussions where few physicians participated.

5. Significance and Future Work

Ethical AI Development: SynDocDis provides a pathway to train medical AI agents and refine clinical guidelines without compromising patient privacy or exposing sensitive decision-making processes.
Medical Education: The synthetic dialogues can serve as high-fidelity training materials for medical students and residents to practice clinical reasoning and peer communication.
Clinical Decision Support: The framework can generate diverse scenarios to test and improve AI tools designed to assist physicians.
Future Directions: The authors plan to move from proprietary models (GPT-4) to open-source and specialized medical models (e.g., MedGemma) and integrate RAG to improve the accuracy of cited medical evidence.

In conclusion, SynDocDis demonstrates that general-purpose LLMs, when guided by structured metadata and specific prompting techniques, can ethically and effectively generate realistic, clinically accurate physician-to-physician dialogues, bridging a critical gap in medical AI research.