Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes

This study demonstrates that supervised continued pre-training of a 4B-parameter LLM on 500,000 de-identified clinical notes significantly enhances its performance on real-world medical decision-making tasks and specific clinical benchmarks, surpassing larger untrained models while successfully retaining general-domain knowledge.

Weissenbacher, D., Shabbir, M., Campbell, I. M., Berdahl, C. T., Gonzalez-Hernandez, G.

Published 2026-04-04
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a brilliant, well-read student named Qwen. This student has read almost every book, news article, and Wikipedia page on the internet. They are incredibly smart at writing essays, solving riddles, and chatting about history. However, if you ask them to diagnose a patient or write a doctor's note, they sound like a very polite, very confused librarian who has never stepped foot in a hospital. They know the words, but they don't know the context or the urgency of real-life medicine.

This paper is about a team of researchers who decided to give Qwen a "medical residency."

The Problem: The "Book Smarts" Gap

The researchers started with a big problem: Large Language Models (LLMs) like Qwen are great at general knowledge, but they lack professional medical know-how. Why? Because the best medical data—real patient notes from hospitals—is locked behind privacy doors. You can't just download it from the internet.

So, the doctors at Cedars-Sinai decided to open their own "library" of 500,000 de-identified (anonymous) patient notes and let Qwen study them.

The Training: "The Shadow Residency"

Think of the training process like a shadow residency.

  1. The Setup: The researchers took a real patient's story (symptoms, test results, physical exam) and gave it to Qwen.
  2. The Task: They asked Qwen to write the "Medical Decision Making" (MDM) section. This is the part of the doctor's note where they explain their thinking: "The patient has chest pain, the EKG shows X, so I think it's Y, and here is my plan."
  3. The Correction: Qwen wrote its version. The researchers compared it to what a real, board-certified emergency doctor had actually written. If Qwen sounded too robotic, too vague, or made up facts, the model got a "red pen" correction (mathematically speaking, a loss function).
  4. The Repetition: They did this 500,000 times.

The Analogy: Imagine Qwen is a chef who has read every cookbook in the world but has never cooked a meal. The researchers gave them 500,000 recipes written by master chefs, but only showed them the ingredients list. Qwen had to guess the final dish. Every time Qwen guessed wrong, the master chef (the real doctor's note) showed them the correct dish. Eventually, Qwen learned not just the ingredients, but the style and logic of a master chef.

The Results: Did the Student Pass?

The researchers tested Qwen in three ways:

1. The Style Test (Human Review)
Two real doctors read the notes Qwen wrote.

  • The Result: They loved them! The notes sounded professional, concise, and "human." They were actually better than the notes written by the base model (the untrained Qwen), which tended to ramble on like a nervous student listing every possible disease.
  • The Catch: The trained Qwen sometimes got too brief, mimicking the real doctors' habit of skipping details to save time. It also occasionally made up small facts (hallucinations), just like a tired human doctor might.

2. The "Diagnosis" Test
They asked Qwen to look at a patient's story and guess the diagnosis.

  • The Result: Qwen got much better at this. It beat not only its own untrained self but also a much larger, more powerful model (Llama-3.1-405B) that hadn't seen any real patient notes. It proved that specialized training beats raw size in this specific context.

3. The "Cardiac Arrest" Test
They asked Qwen to find notes about heart attacks in a pile of documents. This is a very different task from writing a full note.

  • The Result: At first, Qwen got confused and started guessing "heart attack" for everything (a glitch called "label collapse"). But, after a quick, specific "refresher course" just on heart attacks, it became the best at the task, beating even the giant models.

The Side Effects: What Did Qwen Forget?

When you teach a student a new skill, sometimes they forget an old one. The researchers worried Qwen might forget how to do math or answer general questions.

  • The Good News: Qwen kept most of its general knowledge. It didn't turn into a "medical robot" that couldn't talk about the weather or write a poem.
  • The Bad News: Qwen got a bit "lazy" at thinking. Before training, Qwen would show its work step-by-step (like a math student showing their calculations). After training, it started giving answers without showing its work. It became faster but less transparent. It also started repeating itself more often, like a broken record.

The Big Takeaway

This paper proves that you can take a smart, general-purpose AI and turn it into a medical specialist by feeding it real-world hospital notes.

  • The Win: The AI learned to think and write like a doctor, improving its ability to diagnose and make decisions.
  • The Warning: If you aren't careful, the AI might start "cheating" by skipping its reasoning steps or repeating the same wrong answer.

In a nutshell: The researchers built a "medical school" for an AI. The AI graduated with honors in clinical reasoning, but it needs to be reminded to show its homework and not get too repetitive. This is a huge step toward having AI that can actually help doctors in the real world, rather than just reciting medical textbooks.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →