📄 health informatics

Evaluating Large Language Models for Translating Multimodal Phenotype Documentations into Executable EHR Phenotyping Algorithms

This study evaluates frontier large language models for translating multimodal clinical phenotype documentation into executable EHR algorithms, finding that while they effectively interpret structured text, their performance significantly degrades with diagram-only inputs, ultimately identifying documentation quality rather than model capability as the primary bottleneck.

Original authors: Yan, C., Xin, Y., Su, W.-C., Gangireddy, S., Durbhakula, S., Bruehl, S. P., Dickson, A. L., Li, L., Feng, Q., Malin, B. A., Derr, T., Wei, W.-Q.

Published 2026-05-22

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Yan, C., Xin, Y., Su, W.-C., Gangireddy, S., Durbhakula, S., Bruehl, S. P., Dickson, A. L., Li, L., Feng, Q., Malin, B. A., Derr, T., Wei, W.-Q.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef trying to recreate a famous dish, but you don't have the recipe. Instead, you have a messy stack of notes, some scribbled on napkins, some drawn as cartoons, and some written in a confusing mix of languages. Your goal is to turn these messy notes into a precise, step-by-step instruction manual that a robot kitchen can follow to cook the dish perfectly.

This paper is about testing two super-smart AI chefs (called Large Language Models, or LLMs) to see if they can do this job for medical research.

The Problem: The "Lost in Translation" Recipe

In medical research, scientists define specific groups of patients (like "people with Type 2 Diabetes") using complex rules. These rules are usually written in human-readable documents that look like a mix of stories, flowcharts, and tables.

To use these rules in a hospital's computer system, a human expert has to manually translate them into a computer language (SQL). This is like translating a poem into a computer code. It takes a long time, is very tedious, and if two different experts do it, they might end up with slightly different results. The researchers wanted to see if AI could do this translation automatically.

The Experiment: Testing the AI Chefs

The researchers picked two of the smartest AI models available (OpenAI's GPT o3 and Anthropic's Claude Opus 4.1) and gave them five different "recipes" (medical definitions for conditions like kidney injury, heart attacks, and diabetes) from a public library called PheKB.

They tested the AI in three different ways, like giving the chef different types of instructions:

The Full Package: The AI got the whole document (text, charts, and diagrams).
Just the Story: The AI got only the written text and tables, but no pictures.
Just the Pictures: The AI got only the diagrams and flowcharts, with no words.

The Results: What Worked and What Didn't

1. The "Picture-Only" Trap
When the AI tried to read only the diagrams (the flowcharts), it failed miserably. It was like asking a chef to cook a complex meal just by looking at a drawing of a pot and a fork, with no text explaining the ingredients or heat levels. The AI missed crucial details, got the timing wrong, and produced instructions that wouldn't work.

2. The "Story" is King
When the AI got the written text (even without the pictures), it did a very good job. It turned out that the written words contained almost all the information needed. The AI could understand the logic and write the computer code accurately.

3. The AI is a Great Draftsman, Not a Final Editor
Both AI models were surprisingly good at understanding the big picture and the logic of the rules. However, they made specific types of mistakes:

Missing Ingredients: They sometimes forgot to include specific medical codes (like a specific type of medication).
Wrong Numbers: They might get a threshold wrong (e.g., saying "blood pressure over 140" when the rule was "over 150").
Making Things Up: Sometimes, the AI invented rules or conditions that weren't in the original document at all (a "hallucination").
Confusing the Format: When looking at diagrams, they often couldn't figure out how to turn a visual arrow into a logical "if-then" computer command.

The Big Takeaway

The paper concludes that these AI models are not ready to replace human experts yet. They cannot just look at a messy document and spit out a perfect, ready-to-use computer program.

However, they are excellent first draft generators. If you give them clear, structured text, they can write a very good starting point for the code. But because they can make subtle but dangerous mistakes (like getting a number wrong or missing a rule), a human expert must always check their work.

The Final Lesson:
The biggest problem isn't that the AI isn't smart enough; it's that the medical documents aren't written in a way that is easy for computers to read. If doctors and researchers standardized their notes to be clearer and more structured (like writing a recipe in a standard format rather than scribbling on a napkin), the AI would become much more useful. Until then, the AI is a helpful assistant, but the human expert must remain the boss.

The Problem: The "Lost in Translation" Recipe

The Experiment: Testing the AI Chefs

The Results: What Worked and What Didn't

The Big Takeaway

Technical Summary: Evaluating Large Language Models for Translating Multimodal Phenotype Documentations into Executable EHR Phenotyping Algorithms

More like this