📄 health informatics

MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

The paper introduces MIMIC-IV-Phenotype-Atlas (MIPA), the first publicly available benchmark dataset featuring expert-annotated discharge summaries across 16 phenotypes, which enables standardized evaluation of phenotyping methods and demonstrates that large language models outperform traditional rule-based and machine learning approaches in identifying complex medical conditions.

Original authors: Yamga, E., Goudrar, R., Despres, P.

Published 2026-04-24

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Yamga, E., Goudrar, R., Despres, P.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive mystery. You have a library containing millions of patient medical records (Electronic Health Records, or EHRs). These records are a goldmine of information, but they are messy. Some are neat lists of codes (like "Diabetes: Yes"), while others are long, rambling stories written by doctors in their own words (discharge summaries).

Your goal is to find specific groups of patients—say, all the people who have "Heart Failure" or "Depression"—to study them. This process of finding and grouping patients is called Phenotyping.

For a long time, researchers trying to solve this mystery had a major problem: They were all playing different games.

One researcher might use a simple rule: "If the code for Heart Failure appears, count them." Another might use a complex computer program to read the doctor's stories. But because they were using different sets of patient records and different definitions of what counts as "Heart Failure," they couldn't fairly compare who was the best detective. It was like trying to compare a sprinter running on a track to a swimmer in a pool; you couldn't tell who was actually faster.

Enter MIPA: The Great Equalizer

This paper introduces MIPA (MIMIC-IV Phenotype Atlas). Think of MIPA as the Olympic Stadium built specifically for medical data detectives.

Here is how they built it:

The Raw Material: They started with MIMIC-IV, a huge public database of hospital records from a real hospital in Boston.
The Human Judges: Instead of letting computers guess, they hired two expert human judges (a doctor and a medical student). They read 1,456 patient stories and asked: "Does this patient have Depression? Does this patient have Alcohol Abuse?"
The Consensus: If the two judges agreed, great. If they disagreed, they sat down and talked it out until they reached a "gold standard" answer. This created a set of 1,388 perfectly labeled patient stories.
The Toolkit: They didn't just give you the answers; they built a machine that turns the messy hospital data into a clean, organized format that any computer program can use.

Now, anyone in the world can download MIPA and test their own "detective skills" (algorithms) on the exact same 1,388 cases. This allows for a fair, head-to-head race to see who is the best at finding patients.

The Race: Who Won?

To show off how useful MIPA is, the authors ran a race between four different types of "detectives" to see who could find the 16 different conditions (like Diabetes, Dementia, or Heart Failure) best:

The Rule-Follower (ICD Codes): This detective only looks at the official codes.
- Analogy: Like a librarian who only finds books if the barcode matches exactly.
- Result: Good for simple things, but misses the nuance. If a doctor writes "patient feels like they have heart failure" but forgets to type the code, this detective misses it.
The Keyword Hunter (TF-IDF): This detective scans for specific words like "diabetes" or "insulin."
- Analogy: Like a search engine on a website.
- Result: Works well if the word is there, but gets confused if the doctor uses a fancy synonym or describes the condition without using the exact keyword.
The Pattern Learner (Machine Learning): This detective is a computer trained on thousands of examples to spot patterns in numbers and codes.
- Analogy: A student who memorized a textbook but struggles with real-world stories.
- Result: It did okay, but it wasn't the champion. It struggled when the data was messy or incomplete.
The Super-Reader (Large Language Models / AI): This is the new AI (like GPT-4o).
- Analogy: A brilliant detective who can read the doctor's messy story, understand the context, read between the lines, and connect the dots. "The patient was on a ventilator and had low oxygen, which implies heart failure," even if the word "heart failure" wasn't explicitly written.
- Result: The AI won. It was the best at 13 out of the 16 conditions. It especially shined when the clues were hidden in the long, complex stories rather than just the neat lists of codes.

Why Does This Matter?

Before MIPA, researchers were shouting into the void, claiming their methods were great but having no way to prove it against others.

MIPA is the referee. It provides:

A Fair Field: Everyone uses the same 1,388 cases.
A Clear Scoreboard: We can now see exactly which method works best for which disease.
A Path Forward: We learned that while simple rules work for some things, the future of finding patients lies in AI that can understand human language and context.

In short, this paper didn't just build a dataset; it built the standardized playground where the future of medical AI can finally compete, learn, and improve to help doctors find the right patients faster and more accurately.

1. Problem Statement

Electronic Health Records (EHRs) are a critical resource for clinical research, but their secondary use requires transforming raw data into research-grade cohorts through phenotyping (identifying patients with specific conditions). While methods have evolved from rule-based systems to Machine Learning (ML) and Large Language Models (LLMs), the field suffers from a lack of standardized, open-access benchmark datasets.

Current Limitations: Most existing algorithms are evaluated on institution-specific data with heterogeneous definitions, preventing fair head-to-head comparisons and reproducibility.
Data Gaps: Existing public datasets either lack structured EHR data (e.g., n2c2 NLP tasks) or lack unstructured clinical notes (e.g., standard MIMIC-IV usage), making them unsuitable for comprehensive phenotyping tasks that require both modalities.

2. Methodology

The authors developed the MIMIC-IV Phenotype Atlas (MIPA), a dataset derived from the MIMIC-IV v2.2 database, designed specifically for benchmarking EHR phenotyping.

A. Dataset Construction & Annotation

Source: MIMIC-IV database (Beth Israel Deaconess Medical Center, 2008–2019), containing 431,231 admissions.
Phenotypes: 16 distinct phenotypes were selected, covering varying prevalence, complexity, and temporality (e.g., Depression, Type I/II Diabetes, Heart Failure, DVT/PE, Metastatic Cancer).
Annotation Process:
- Candidates: 1,456 discharge summaries were initially identified based on ICD codes.
- Annotators: Two independent reviewers (an internal medicine physician and a medical student) labeled each summary for the presence/absence of all 16 phenotypes.
- Consensus: Disagreements were resolved via open deliberation. Documents with $\ge$ 3 disagreements were rejected; those with 1–2 disagreements underwent consensus review.
- Final Corpus: 1,388 expert-annotated discharge summaries (1,388 "gold" labels).
Quality Control: Inter-annotator agreement was high (Mean Document-level Kappa = 0.805; Mean Label-level Kappa = 0.771). 91% of initial disagreements were resolved via consensus.

B. Processing Pipeline & Feature Engineering

To support supervised learning, the authors built a pipeline to transform raw MIMIC-IV data into structured feature matrices:

Hybrid Labeling Strategy:
- Training Set: "Silver" labels generated via weak supervision (rule-based ICD code matching) from the broader unannotated MIMIC-IV database.
- Validation/Test Sets: Exclusively the 1,388 "gold" labeled summaries (split 40% validation, 60% testing).
Feature Extraction:
- Structured Data: ICD codes (mapped to ICD-10), medications (normalized to generic names), labs, and chart events.
- Unstructured Data: Discharge summaries and radiology reports processed via MedCAT to extract UMLS Concept Unique Identifiers (CUIs).
- Dimensionality Reduction: Features with frequency below the 25th percentile (or 10th for CUIs) were removed to reduce noise.

C. Benchmarking Framework

The authors evaluated four distinct phenotyping approaches on the MIPA test set:

ICD-based Heuristics: Rule-based classifiers using ICD code counts (thresholds $\ge$ 1, $\ge$ 2, $\ge$ 3).
Keyword-driven TF-IDF: Term Frequency-Inverse Document Frequency classifiers using curated keyword lists.
Supervised Machine Learning: Logistic Regression, Naive Bayes, Random Forest, and Gradient Boosting trained on silver-labeled data.
Large Language Models (LLMs): GPT-4o using Chain-of-Thought prompting to classify raw discharge summaries.

3. Key Results

Dataset Characteristics

Size: 1,388 discharge summaries.
Prevalence: Highly variable. Hypertension (67.7%) and Depression (48.9%) were most common; DVT/PE complications (2.8%) and C. difficile complications (3.0%) were rare.
Data Richness: Median token count of 2,164 per summary; includes diverse sections (History of Present Illness, Past Medical History, etc.) and structured EHR features.

Benchmarking Performance (F1 Scores)

LLM Superiority: GPT-4o achieved the highest F1 score in 13 out of 16 phenotypes (Mean F1 = 0.85).
- Best Margins: LLMs excelled in conditions requiring contextual interpretation of narrative text (e.g., DVT/PE history, Type I Diabetes, Systemic Lupus), outperforming the next best method by 0.05 to 0.52 F1 points.
Supervised ML: Showed moderate improvements over baselines but generally underperformed compared to LLMs (Mean F1 $\approx$ 0.44). Performance varied by model type (e.g., Gradient Boosting for Lupus, Random Forest for Diabetes).
Baseline Methods:
- ICD Rules: Performed well for well-coded conditions (e.g., Heart Failure, Metastatic Cancer) but degraded significantly with stricter thresholds.
- TF-IDF: Effective for conditions with explicit keywords (e.g., Depression, Hypertension) but unreliable for complex or rare conditions.

4. Key Contributions

First Open Benchmark: MIPA is the first publicly available, expert-annotated dataset specifically designed for EHR phenotyping, bridging the gap between structured EHR data and unstructured clinical notes.
High-Quality Gold Standard: Provides a rigorous, consensus-based annotation of 16 diverse phenotypes with high inter-rater reliability, serving as a "ground truth" for the community.
Reproducible Pipeline: Offers a complete, open-source processing pipeline (GitHub) that transforms raw MIMIC-IV data into phenotype-specific feature matrices, enabling fair comparison of future methods.
Empirical Evidence for LLMs: Demonstrates that LLMs significantly outperform traditional rule-based and supervised ML methods, particularly for phenotypes where evidence is embedded in nuanced clinical narratives rather than structured codes.

5. Significance and Limitations

Significance:

MIPA addresses a critical bottleneck in computational phenotyping by enabling standardized, reproducible evaluation.
It shifts the paradigm from institution-specific validation to community-wide benchmarking.
The results highlight the specific utility of LLMs in extracting semantic meaning from clinical text, suggesting a future where hybrid models (combining structured data and LLMs) may become the standard.

Limitations:

Single Institution: Data is from one academic center (Beth Israel Deaconess), potentially limiting generalizability to other healthcare systems.
Annotation Scope: Relies on discharge summaries only; longitudinal data from other note types (e.g., progress notes) was not included.
Adjudication: While rigorous, the consensus process lacked a third-party adjudicator, and detailed decision logs were not preserved for every disagreement.
Prevalence Bias: The dataset prevalence does not perfectly mirror real-world epidemiology due to the sampling strategy based on ICD codes.

Conclusion:
MIPA provides a durable reference resource that combines expert curation with a reproducible technical pipeline. By facilitating fair comparisons between ICD heuristics, traditional ML, and LLMs, it aims to accelerate the development of robust, automated phenotyping systems for clinical research.