EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Imagine a hospital as a massive, ancient library. This library doesn't just hold books; it holds the entire life story of every patient who has ever walked through its doors. It contains details about their heartbeats, the medicines they took, the surgeries they had, and how much their care cost. This digital library is called an Electronic Health Record (EHR).

The problem? The library is so huge and organized in such a complex way that only a few librarians (the IT experts and doctors with special training) know how to find specific information. If a regular nurse or a billing manager wants to ask, "How many patients with high blood pressure were treated last month?" they can't just ask the library. They have to fill out a complex form or wait for a tech expert to write a specific code (called SQL) to get the answer. It's like trying to get a book from a library by speaking in a secret, robotic language that no one else understands.

Enter EHRSQL: The "Google" for Hospital Data.

This paper introduces a new tool called EHRSQL. Think of it as a translator that turns normal human questions into that secret robotic code, allowing anyone to ask the hospital database anything in plain English.

Here is how they built it and why it's special, explained through a few simple metaphors:

1. The "Real People" Poll (Not Just Robots)

Most previous attempts to teach computers this skill were like training a chef by only showing them a menu written by a robot. They used pre-made templates.

What they did: The researchers went to a real hospital and asked 222 actual staff members (doctors, nurses, insurance reviewers) to write down the questions they actually wanted to ask the database.
The Result: Instead of robotic questions like "Select patient ID," they got real questions like, "Show me the top 5 drugs prescribed to patients diagnosed with hypotension in the last two months." This makes the system speak the language of real hospital workers.

2. The "Time Travel" Challenge

In a hospital, time is everything. A doctor might ask, "What was the patient's heart rate yesterday?" or "How many days since the last surgery?"

The Challenge: Computers are notoriously bad at understanding "yesterday" or "last month" because those words change meaning every day.
The Solution: The researchers built a special "Time Machine" into the dataset. They taught the system to understand different ways humans talk about time (absolute dates, relative time like "last week," and mixed time). They even shifted the dates in the database to the year 2105 to make sure the system could handle "today" and "tomorrow" without getting confused by the real-world dates.

3. The "Honesty" Test (Knowing What It Doesn't Know)

This is the most crucial part. Imagine a student taking a test. If they don't know the answer, a bad student might guess and get it wrong. A trustworthy student raises their hand and says, "I don't know."

The Problem: In healthcare, guessing is dangerous. If a system hallucinates (makes up) an answer about a patient's allergy, it could be fatal.
The Solution: The researchers included "unanswerable" questions in their dataset. These are questions that sound normal but the database simply cannot answer (e.g., "What is the best way to treat a headache?" or "When is the patient's next appointment?"—the database only knows the past, not the future).
The Goal: They trained the AI to recognize these questions and refuse to answer them, rather than making up a fake SQL query. This is called "Trustworthy Semantic Parsing."

4. The "Nested" Puzzle

The hospital database is like a set of Russian nesting dolls. To find one piece of information, you often have to open three or four different layers of tables (Admissions -> ICU Stays -> Vital Signs).

The Innovation: Instead of just teaching the AI to jump between tables randomly, they taught it to build "nested" queries. It's like giving the AI a map that says, "First, find the patient. Then, find their hospital stay. Then, find the specific day. Then, look at the heart rate." This is much more efficient and accurate for huge databases.

Why Does This Matter?

Currently, hospitals are sitting on a goldmine of data, but most of it is locked behind a wall of complex code. EHRSQL is the key to unlocking that door.

For Doctors: They can instantly see trends, like "Are our new heart medications working better than the old ones?"
For Administrators: They can quickly calculate costs or insurance claims without waiting for IT.
For Safety: By teaching the AI to say "I don't know" when it's unsure, they prevent dangerous medical errors caused by computer hallucinations.

In short, EHRSQL is a bridge. It connects the messy, complex reality of hospital data with the simple, natural way humans ask questions, all while ensuring the computer is smart enough to know when not to answer. It's a giant leap toward making healthcare data work for everyone, not just the tech experts.

Here is a detailed technical summary of the paper "EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records."

1. Problem Statement

Electronic Health Records (EHRs) are vast relational databases containing comprehensive patient medical histories. While they hold immense value for clinical decision-making, accessing this data is currently a bottleneck. Most hospital staff (physicians, nurses, administrators) rely on pre-defined rule-based systems or require specialized training to write SQL queries to retrieve specific information.

Existing Text-to-SQL datasets for healthcare (e.g., MIMICSQL, emrKBQA) suffer from significant limitations:

Synthetic Data: Questions are often automatically generated using templates, failing to reflect the diverse, complex, and natural language needs of real hospital staff.
Simplicity: They often restrict queries to simple operations on a few tables, ignoring the complex, multi-step reasoning required in real clinical scenarios.
Lack of Robustness: They assume all input questions are answerable, ignoring the critical need for systems to recognize and refuse unanswerable questions (e.g., those requiring external knowledge or outside the database schema) to ensure patient safety.
Time Sensitivity: They inadequately handle the complex time expressions (absolute, relative, mixed) crucial in healthcare contexts.

2. Methodology

The authors constructed EHRSQL, a large-scale, practical Text-to-SQL dataset linked to two open-source EHR databases: MIMIC-III and eICU. The construction process involved four main stages:

A. Data Collection (The Poll)

Participants: 222 hospital staff members (physicians, nurses, insurance reviewers, etc.) from Konyang University Hospital.
Process: A survey was conducted asking staff what structured information they frequently seek.
Output: 1,742 utterances were collected. These were filtered to remove ambiguous or external-knowledge-dependent queries, resulting in a set of valid questions and a set of "unanswerable" questions (those that cannot be answered by the database schema).

B. Question and SQL Generation

Template Creation: Valid utterances were converted into 230 question templates (174 answerable, 56 unanswerable). These templates cover single patients, groups of patients, and general statistics.
Time Template System: To address the time-sensitive nature of healthcare, the authors developed a systematic time filter system with three types:
1. Global: Constrains the full time range (e.g., "since 2020").
2. Within: Indicates time between events (e.g., "within 2 months of diagnosis").
3. Exact: Points to specific times (e.g., "at 12:00").
  Each filter includes expression types (absolute, relative, mixed), units (day, month, hospital visit), and intervals.
SQL Annotation: Four graduate students manually labeled SQL queries for the templates over five months.
- Nested Queries: Unlike standard datasets that favor JOIN operations, annotators were instructed to use nested queries (subqueries) to reflect the hierarchical structure of EHR schemas and improve execution efficiency on massive datasets (e.g., MIMIC-III has 330M rows in chartevents).
- Schema Agnosticism: The question authors did not know the database schema, ensuring the dataset reflects real-world usage.

C. Data Augmentation and Pre-processing

Paraphrasing: To add linguistic diversity, templates were paraphrased using human annotators and ML models (T5, back-translation), followed by quality filtering (RoBERTa, GPT-Neo).
Database Modification:
- Cost Table: A new cost table was added to both databases to support insurance and administrative queries.
- Time Shifting: Patient admission times were shifted to a simulated future range (2100–2105) to allow for realistic relative time expressions (e.g., "yesterday," "last month") relative to a fixed "current time."
- De-identification: Patient-specific values were shuffled across records to prevent re-identification while maintaining semantic structure.

D. Benchmark Task: Trustworthy Semantic Parsing

The paper defines a new task evaluation metric that goes beyond simple execution accuracy. The model must:

Distinguish Answerability: Identify if a question is answerable or unanswerable (Out-of-Domain detection).
Generate SQL: Produce correct SQL for answerable questions.
Refusal Mechanism: If the model's confidence (measured by decoding entropy) is below a threshold, it must refuse to execute the query.

3. Key Contributions

EHRSQL Dataset: The first large-scale Text-to-SQL dataset for EHRs derived from real human polls rather than automated generation. It contains ~24,400 question-SQL pairs spanning 13.5 tables on average (significantly more complex than MIMICSQL's 5 tables).
Trustworthy QA Framework: Introduces the concept of "Trustworthy Semantic Parsing," where models must explicitly handle unanswerable questions to prevent hallucination in clinical settings.
Complex Time Handling: A systematic approach to generating and evaluating time-sensitive queries, covering 93.2% of queries involving time columns.
Nested Query Focus: A shift from standard JOIN-heavy queries to nested subqueries, better suited for the hierarchical and massive scale of real EHR databases.
Dual-Database Support: The first dataset to label SQL queries for both MIMIC-III and eICU, enabling cross-database generalization research.

4. Results and Findings

The authors evaluated baseline models (T5-base and T5-base with schema serialization) on the EHRSQL test set.

Performance on Unanswerable Questions:
- Models struggled to distinguish answerable from unanswerable questions without a confidence threshold.
- Using a percentile-based entropy threshold (rejecting queries with high uncertainty) significantly improved the F1 score for answerability ( $F1_{ans}$ ) to ~94% on the validation set, while maintaining high execution accuracy.
Execution Accuracy:
- Standard models achieved moderate execution accuracy (~76-77% on answerable questions).
- Schema Serialization: Adding schema information to the input did not significantly improve performance, suggesting that pre-trained language models may not effectively leverage schema details in single-database settings for this specific task.
Zero-Shot Transfer:
- When applying a state-of-the-art model trained on the Spider dataset (GAP) to EHRSQL, performance dropped drastically (from 16.4% on MIMICSQL to 4.7% on EHRSQL).
- This highlights the domain gap: EHRSQL contains complex time operators and nested structures that standard Spider parsers cannot handle, proving the need for domain-specific benchmarks.

5. Significance

Bridging Research and Practice: EHRSQL moves Text-to-SQL research from academic toy problems to realistic healthcare scenarios, addressing the "last mile" problem of deploying AI in hospitals.
Safety First: By incorporating unanswerable questions and a refusal mechanism, the dataset promotes the development of AI systems that prioritize safety and reliability over blindly generating SQL, which is critical for clinical decision support.
Future Directions: The dataset serves as a foundation for interactive QA, multimodal EHR analysis, and the development of uncertainty-aware semantic parsing models.

In conclusion, EHRSQL represents a significant step forward in making AI accessible to non-technical hospital staff while ensuring the systems are robust enough to handle the complexities and safety requirements of real-world medical data.