Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a brilliant, well-read student named Qwen. This student has read almost every book, news article, and Wikipedia page on the internet. They are incredibly smart at writing essays, solving riddles, and chatting about history. However, if you ask them to diagnose a patient or write a doctor's note, they sound like a very polite, very confused librarian who has never stepped foot in a hospital. They know the words, but they don't know the context or the urgency of real-life medicine.

This paper is about a team of researchers who decided to give Qwen a "medical residency."

The Problem: The "Book Smarts" Gap

The researchers started with a big problem: Large Language Models (LLMs) like Qwen are great at general knowledge, but they lack professional medical know-how. Why? Because the best medical data—real patient notes from hospitals—is locked behind privacy doors. You can't just download it from the internet.

So, the doctors at Cedars-Sinai decided to open their own "library" of 500,000 de-identified (anonymous) patient notes and let Qwen study them.

The Training: "The Shadow Residency"

Think of the training process like a shadow residency.

The Setup: The researchers took a real patient's story (symptoms, test results, physical exam) and gave it to Qwen.
The Task: They asked Qwen to write the "Medical Decision Making" (MDM) section. This is the part of the doctor's note where they explain their thinking: "The patient has chest pain, the EKG shows X, so I think it's Y, and here is my plan."
The Correction: Qwen wrote its version. The researchers compared it to what a real, board-certified emergency doctor had actually written. If Qwen sounded too robotic, too vague, or made up facts, the model got a "red pen" correction (mathematically speaking, a loss function).
The Repetition: They did this 500,000 times.

The Analogy: Imagine Qwen is a chef who has read every cookbook in the world but has never cooked a meal. The researchers gave them 500,000 recipes written by master chefs, but only showed them the ingredients list. Qwen had to guess the final dish. Every time Qwen guessed wrong, the master chef (the real doctor's note) showed them the correct dish. Eventually, Qwen learned not just the ingredients, but the style and logic of a master chef.

The Results: Did the Student Pass?

The researchers tested Qwen in three ways:

1. The Style Test (Human Review)
Two real doctors read the notes Qwen wrote.

The Result: They loved them! The notes sounded professional, concise, and "human." They were actually better than the notes written by the base model (the untrained Qwen), which tended to ramble on like a nervous student listing every possible disease.
The Catch: The trained Qwen sometimes got too brief, mimicking the real doctors' habit of skipping details to save time. It also occasionally made up small facts (hallucinations), just like a tired human doctor might.

2. The "Diagnosis" Test
They asked Qwen to look at a patient's story and guess the diagnosis.

The Result: Qwen got much better at this. It beat not only its own untrained self but also a much larger, more powerful model (Llama-3.1-405B) that hadn't seen any real patient notes. It proved that specialized training beats raw size in this specific context.

3. The "Cardiac Arrest" Test
They asked Qwen to find notes about heart attacks in a pile of documents. This is a very different task from writing a full note.

The Result: At first, Qwen got confused and started guessing "heart attack" for everything (a glitch called "label collapse"). But, after a quick, specific "refresher course" just on heart attacks, it became the best at the task, beating even the giant models.

The Side Effects: What Did Qwen Forget?

When you teach a student a new skill, sometimes they forget an old one. The researchers worried Qwen might forget how to do math or answer general questions.

The Good News: Qwen kept most of its general knowledge. It didn't turn into a "medical robot" that couldn't talk about the weather or write a poem.
The Bad News: Qwen got a bit "lazy" at thinking. Before training, Qwen would show its work step-by-step (like a math student showing their calculations). After training, it started giving answers without showing its work. It became faster but less transparent. It also started repeating itself more often, like a broken record.

The Big Takeaway

This paper proves that you can take a smart, general-purpose AI and turn it into a medical specialist by feeding it real-world hospital notes.

The Win: The AI learned to think and write like a doctor, improving its ability to diagnose and make decisions.
The Warning: If you aren't careful, the AI might start "cheating" by skipping its reasoning steps or repeating the same wrong answer.

In a nutshell: The researchers built a "medical school" for an AI. The AI graduated with honors in clinical reasoning, but it needs to be reminded to show its homework and not get too repetitive. This is a huge step toward having AI that can actually help doctors in the real world, rather than just reciting medical textbooks.

1. Problem Statement

Large Language Models (LLMs) currently possess limited professional medical knowledge because they are primarily trained on general public text rather than real-world clinical data. Access to Electronic Health Records (EHRs) is restricted due to patient privacy concerns, leading to a reliance on synthetic benchmarks (e.g., MedQA) or small, narrow datasets (e.g., MIMIC-III/IV). This creates a significant gap between model performance on standardized tests and their ability to handle the ambiguity, longitudinal complexity, and specific documentation styles of real-world clinical decision-making. Furthermore, there is limited empirical evidence on whether fine-tuning LLMs on real clinical notes improves specific clinical tasks without causing "catastrophic forgetting" of general language capabilities.

2. Methodology

Data Collection and Preprocessing

Source: The study utilized approximately 5 million de-identified Emergency Department (ED) physician notes from the Cedars-Sinai Health System.
De-identification: A rigorous process replaced all Personally Identifiable Information (PHI) with {phi} and all digits with {#}.
Segmentation: Using regular expressions, the authors isolated the Patient Presentation (History, Physical Exam, Diagnostic Testing) from the Medical Decision Making (MDM) section and the Assigned Diagnosis.
Dataset Size: 511,077 notes were successfully processed. Of these, 361,595 had fully recoverable ICD codes for diagnosis prediction tasks.
Data Split: 510,327 for training, 250 for validation, and 500 for testing.

Model Architecture and Training

Base Model: Qwen3-4B Instruct, a 4-billion parameter model chosen for its reasoning capabilities and resource efficiency.
Training Objective: Supervised Full Fine-Tuning (SFT). The model was trained to generate the MDM paragraph given the patient presentation, effectively learning to emulate the physician's diagnostic reasoning process.
Mitigating Catastrophic Forgetting:
- Data Augmentation: The training set was augmented with 10% general-domain examples (from Databricks Dolly 15K, Super-NaturalInstructions, and LMSYS-Chat-1M) to preserve general language abilities.
- Hyperparameters: Learning rate reduced to $1e^{-5}$ ; each example presented only once; max sequence length capped at 6,050 tokens.
- Hardware: Trained on a single node with 8 A100 GPUs for 172 hours.

Evaluation Framework

The study employed a multi-axis evaluation strategy:

Intrinsic Evaluation (Qualitative): Two board-certified emergency physicians evaluated 25 generated MDMs using the PDQI-9 instrument (scoring Accuracy, Completeness, Usefulness, Internal Consistency, and Comprehensibility) against gold-standard physician notes and the base model.
Extrinsic Evaluation (Quantitative - Clinical Tasks):
- Assigned Diagnosis Prediction: A multi-label classification task predicting ICD-10/9 codes from patient presentations.
- In-Hospital Cardiac Arrest (IHCA) Detection: A binary classification task to detect mentions of cardiac arrest in notes. The model was tested in zero-shot, few-shot, and fine-tuned settings.
General & Biomedical Benchmarks: Performance was measured against the base model on HELM-Lite (general domain) and MedHELM (biomedical domain) to assess knowledge retention and transfer.

3. Key Contributions

Focus on MDM Generation: Unlike previous works focusing on broad summarization, this study targets the Medical Decision Making (MDM) section, which encapsulates the core of clinical reasoning (synthesizing data into a differential diagnosis and plan).
Large-Scale Real-World Data: The use of 500K+ de-identified ED notes from a major academic health system provides a more diverse and realistic training corpus than the commonly used MIMIC datasets.
Comprehensive Evaluation: The study provides a rigorous assessment of knowledge transfer (from MDM generation to diagnosis prediction and IHCA detection) and capability retention (general and biomedical benchmarks), addressing the trade-off between specialization and generalization.
Reproducible Framework: The authors offer a scalable pipeline using a compact, open-source model (Qwen3-4B) that demonstrates how institutions can develop clinically grounded LLMs without requiring trillion-parameter resources.

4. Key Results

Qualitative Findings (MDM Generation)

The fine-tuned model produced MDMs that closely mimicked the style of human physicians (brevity, implicit reasoning) and achieved higher scores in Usefulness (4.24) and Comprehensibility (4.42) compared to the base model.
Limitations: The model occasionally suffered from completeness issues (omitting differential diagnoses) and internal consistency errors (hallucinations/unfounded claims), mimicking the brevity of real-world notes but sometimes lacking thoroughness.

Quantitative Findings (Clinical Tasks)

Diagnosis Prediction: The fine-tuned model significantly outperformed the base Qwen3-4B and even larger models (Qwen3-32B, Llama-3.1-405B).
- Micro-Strict F1: Improved from 2.47% (base) to 6.35% (fine-tuned).
- Micro-Overlapping F1: Improved from 7.28% to 23.27%.
IHCA Detection:
- In a zero-shot setting, the MDM-trained model performed poorly (F1 = 10%).
- After task-specific fine-tuning, it achieved an F1 of 0.89, outperforming the few-shot Llama-3.1-405B (F1 = 0.87) and the task-specific SFT baseline (F1 = 0.74). This demonstrated successful transfer of clinical reasoning patterns to a distant task.

General & Biomedical Retention

The fine-tuned model retained 7/12 general-domain and 5/9 biomedical-domain benchmark scores within an acceptable margin (7 points) of the baseline.
Reasoning Collapse: A notable degradation was observed in multi-step reasoning tasks (e.g., GSM8K, MedQA). The fine-tuning process caused the model to stop generating meaningful Chain-of-Thought (CoT) traces (the <thought> tags became empty or short), leading to a drop in performance on logic-heavy benchmarks.

5. Significance and Implications

Feasibility of EHR-Informed Models: The study proves that compact, open-source models can be effectively specialized for clinical tasks using real-world EHR data, narrowing the gap between synthetic benchmarks and clinical reality.
Knowledge Transfer: Clinical reasoning learned from MDM generation transfers effectively to other structured tasks (diagnosis prediction) and phenotyping tasks (IHCA detection), even when the target task is structurally different.
Trade-offs and Risks: The study highlights critical failure modes of naive supervised fine-tuning:
- Label/Mode Collapse: The model tends to over-predict specific labels (e.g., cardiac arrest) or repeat tokens.
- Reasoning Degradation: The loss of explicit CoT behavior suggests that standard SFT may strip models of their ability to "show their work," which is crucial for clinical trust and safety.
Future Directions: The authors suggest that future iterations must incorporate Reinforcement Learning with Human Feedback (RLHF) to improve completeness and explicit reasoning, and utilize CoT-augmented fine-tuning to preserve multi-step reasoning capabilities while specializing the model.

In conclusion, this paper demonstrates that supervised continued pretraining on real clinical notes is a viable strategy for enhancing medical LLMs, provided that mechanisms are in place to prevent catastrophic forgetting and the degradation of reasoning capabilities.