Regression vs. Medical LLMs: A Comprehensive Study for… — Plain-Language Explanation

Original authors: KOM SANDE, S. D., Skorski, M., Theobald, M., Schneider, J., Marz, W.

Published 2026-03-11

📖 5 min read🧠 Deep dive

Original authors: KOM SANDE, S. D., Skorski, M., Theobald, M., Schneider, J., Marz, W.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a doctor trying to predict which of your heart patients might face a serious health crisis in the next year. Traditionally, you've relied on a set of mathematical recipes (called regression models) that look at a patient's numbers—like cholesterol levels, blood pressure, and age—to calculate a risk score. These recipes have been the gold standard for decades, but they are rigid; they follow strict rules and can't really "think" about the bigger picture.

Now, a new contender has entered the arena: Medical Large Language Models (MedLLMs). Think of these not as calculators, but as super-smart medical interns who have read every medical textbook, journal, and case study in existence. They don't just crunch numbers; they understand the story behind the numbers.

This paper is a massive "cook-off" to see who wins: the old-school mathematical recipes or the new-school AI interns.

The Setup: The "LURIC" Kitchen

The researchers used a giant, well-stocked pantry of data called the LURIC study. It contains detailed records of over 3,300 heart patients from Germany.

The Ingredients: Instead of using expensive, hard-to-get doctor's notes (like a long essay written by a physician), they used routine blood test results and basic health stats. Think of these as the standard ingredients you find in any kitchen: cholesterol, kidney function, age, smoking history, etc.
The Goal: Predict if a patient will pass away within one year (a very serious prediction, but crucial for knowing who needs extra care).

The Contestants

1. The Old Guard (Regression Models)
These are the veteran chefs. They include algorithms like CatBoost and XGBoost.

How they work: They are like a master chef who has memorized thousands of recipes. If "Cholesterol is high + Age is over 60," they instantly know the risk is high. They are fast, reliable, and very good at handling structured numbers.
The Result: They performed excellently, hitting a success rate (AUROC) of about 85%. This is the benchmark everyone else had to beat.

2. The New Kids (Medical LLMs)
These are the AI interns. The researchers tested two ways to use them:

The "Prompting" Method (Zero/Few-Shot): Imagine handing the AI intern a list of patient numbers and saying, "Here is a patient's data. Based on your training, what is their risk?"
- The Twist: The researchers tried giving the AI a few examples first (like showing it three other patients and their outcomes) before asking about the new one. This is called "few-shot prompting."
- The Result: The big AI models (like the 70-billion-parameter ones) did surprisingly well, reaching 82%. They were almost as good as the veteran chefs, but they needed a little help (the examples) to get there.
The "Finetuning" Method: Imagine taking a smart intern and giving them a crash course specifically on this hospital's data.
- The Result: Smaller AI models (8-billion parameters) that were "finetuned" actually beat the big models and even surpassed some commercial giants like ChatGPT. They reached 82-85%, matching the veteran chefs.

The Big Surprises

1. The "Small but Mighty" Discovery
Usually, in the world of AI, bigger is better. A 70-billion-parameter model is a giant brain; an 8-billion one is a smaller brain. But this study found that if you train the small brain specifically on heart disease data, it can outperform the giant, untrained brain. It's like a specialized mechanic who knows your specific car model better than a generalist who knows everything about cars but nothing about yours.

2. The "Over-Confident" Problem (Calibration)
While the AI interns were good at ranking patients (knowing who is riskier than whom), they were bad at guessing the exact percentage.

The Metaphor: Imagine the AI says, "There is a 90% chance this patient will be fine." But in reality, the chance was only 80%. The AI was slightly too optimistic (or in some cases, too pessimistic).
The Fix: The researchers applied a "calibration filter" (called Platt scaling). Think of this as a translator that takes the AI's raw guess and adjusts it to match reality. This simple step fixed the AI's confidence issues, making its predictions much more trustworthy for doctors.

3. No Need for the "Essay"
Previous studies tried to feed these AI models long, messy doctor's notes. This study proved you don't need the essay. You can just feed the AI the raw numbers (the blood test results), and it can still do a fantastic job. This is huge because it means this technology can be used in almost any hospital, even those that don't have perfect digital records.

The Verdict

The paper concludes that Medical AI is ready for the big leagues.

The Old Guard (Regression) is still the reliable workhorse, especially when you need speed and simplicity.
The New Guard (MedLLMs) is a powerful new tool. With the right "prompt" (instructions) or a little bit of specific training, they can match or even beat the best traditional methods.

The Takeaway:
We don't have to choose between the old math and the new AI. The future is likely a hybrid kitchen: using the speed of traditional math for quick checks, and the deep understanding of AI for complex cases, all while using the simple, everyday blood test data that hospitals already collect. It's a win for patients, because better prediction means better, earlier care.

Regression vs. Medical LLMs: A Comprehensive Study for CVD and Mortality Risk Prediction

The Setup: The "LURIC" Kitchen

The Contestants

The Big Surprises

The Verdict

1. Problem Statement

2. Methodology

2.1 Dataset and Cohorts

2.2 Input Features (Biomarker Sets)

2.3 Model Architectures

2.4 Prompting Strategy

2.5 Evaluation Metrics

3. Key Contributions

4. Key Results

4.1 Predictive Performance (AUROC)

4.2 Calibration and Stratification

4.3 Comparison with Prior Work

5. Significance and Conclusion

Regression vs. Medical LLMs: A Comprehensive Study for CVD and Mortality Risk Prediction

The Setup: The "LURIC" Kitchen

The Contestants

The Big Surprises

The Verdict

1. Problem Statement

2. Methodology

2.1 Dataset and Cohorts

2.2 Input Features (Biomarker Sets)

2.3 Model Architectures

2.4 Prompting Strategy

2.5 Evaluation Metrics

3. Key Contributions

4. Key Results

4.1 Predictive Performance (AUROC)

4.2 Calibration and Stratification

4.3 Comparison with Prior Work

5. Significance and Conclusion

More like this