An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System

Imagine a massive, bustling hospital network called Shriners Children's. It's like a giant library with 22 different branches, each holding millions of patient stories (medical records). For years, these stories were written in different languages, on different types of paper, and stored in different filing systems. Some were in old ledgers (ICD-9), some in newer digital files (ICD-10), and some were just scribbled notes.

The doctors and researchers wanted to use Artificial Intelligence (AI) to read these stories, find patterns, and help kids get better. But there was a problem: You can't teach a robot to read if the books are written in gibberish or are missing pages.

This paper is the story of how the team at Shriners cleaned up their library, built a better filing system, and tested if their new AI tools actually work in the real world.

Here is the breakdown of their journey, using simple analogies:

1. The Problem: A Messy Library

The researchers realized that before they could build a fancy AI robot, they needed to fix the foundation.

The Old System: Their data warehouse was like a library built in 2015 using an old map. The books were there, but the cataloging system was outdated.
The Goal: They wanted to move everything to a modern, universal standard called OMOP CDM. Think of this as translating every book in the library into a single, perfect language that every computer in the world understands.
The Hurdle: The tools they usually use to check the library's quality (called the "Data Quality Dashboard") were built with programming languages (R and Java) that didn't play nice with their new, secure cloud computer system (Microsoft Fabric). It was like trying to use a gas-powered car engine in an electric car.

2. The Solution: Building a New Tool

Instead of waiting for the old tools to be fixed, the team built their own.

The Translator: They rewrote the quality-checking tool in Python (a popular coding language), making it compatible with their new cloud system.
The "Trustworthy" Check: They didn't just check if the books were there; they checked if the stories made sense. They used a framework called METRIC, which is like a checklist for a detective:
- Measurement: Did the doctor write this down correctly, or was it a typo?
- Timeliness: Is this information up-to-date, or is it from 10 years ago?
- Representativeness: Does this data cover all types of patients, or just a specific group?
- Informativeness: Is there enough detail, or are there huge gaps?
- Consistency: Do the numbers match up across different hospital branches?

3. The Experiment: Two Ways to Test

The team tested their new system in two ways:

A. The "Systematic" Approach (The Big Sweep)
They looked at the entire library (all 22 hospitals).

Result: After modernizing the system, the quality score went up by about 8%. It was like organizing the library so you could find books faster.
The Catch: They found that data was missing in specific patterns. For example, one hospital branch might have great records for surgeries but terrible records for mental health notes. This told them that the AI needs to be careful about where it gets its data.

B. The "Case Study" Approach (The Deep Dive)
They picked one specific condition: Craniofacial Microsomia (CFM). This is a complex condition affecting a child's face and jaw, requiring many different doctors (surgeons, psychologists, etc.).

The Goal: Could the AI predict if a child with this condition might develop mental health issues based on their surgery history?
The Twist: They tried two ways to feed data to the AI:
1. Raw Data: Feeding the AI the original, messy codes (like "Code A" from 2010 and "Code B" from 2020).
2. Harmonized Data: Translating everything into the new, clean language first.
The Surprise: The AI performed almost exactly the same with both methods!
- Analogy: Imagine trying to solve a puzzle. You can use the original, jagged puzzle pieces, or you can smooth them out first. The team found that smoothing them out (harmonizing) didn't make the puzzle easier to solve, but it did make it much easier to share the puzzle with other people.
- Lesson: Standardizing data doesn't hurt the AI; it just makes the data safer and easier to share.

4. The Big Takeaway

The paper concludes that building AI for healthcare isn't just about writing smart algorithms. It's about plumbing.

The "Know-Do" Gap: We know how to build AI, but we struggle to put it to work in hospitals because the data is messy.
The Hybrid Approach: You need a mix of Systematic (fixing the whole library) and Case-Specific (solving one specific puzzle) strategies.
The Future: They are now working on FHIR, which is like a universal "USB port" for medical data. It allows different apps and systems to plug in and talk to each other instantly, rather than just looking at static dashboards.

In a Nutshell

This study is a blueprint for how to clean up a giant, messy medical data warehouse so that AI can actually be trusted to help doctors. They proved that while cleaning the data is hard work, it's the only way to ensure that when an AI makes a suggestion, it's based on truth, not confusion. They built a new tool to check the data, found that standardizing data helps collaboration without hurting performance, and showed that the future of medical AI depends on making data "interoperable" (able to talk to each other) rather than just accurate.

Here is a detailed technical summary of the paper "An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System."

1. Problem Statement

The rapid adoption of Artificial Intelligence (AI) in healthcare is hindered by a "know-do gap" between controlled research environments and real-world clinical deployment. Key barriers include:

Data Quality & Standardization: Real-world Electronic Health Records (EHRs) are multimodal, unstructured, and often lack consistency across multisite systems.
Implementation Gaps: Existing frameworks often prioritize model evaluation over the foundational data quality and infrastructure required for deployment.
Tooling Limitations: Standard data quality tools (e.g., OHDSI's Data Quality Dashboard) rely on specific technology stacks (R/Java) that may be incompatible with modern cloud environments (e.g., Microsoft Fabric).
Trustworthy AI (TAI) Deficit: There is a lack of practical mechanisms to integrate TAI principles (such as the METRIC framework) into routine data quality assessments.

The study focuses on Shriners Children's (SC), a large multisite pediatric system, to address these challenges by modernizing its Research Data Warehouse (RDW) and evaluating AI implementation strategies.

2. Methodology

The study employs a hybrid approach combining AI Implementation Science with Data Engineering and Clinical Case Studies.

A. Infrastructure Modernization

Target Environment: Migrated SC's RDW from an older OMOP Common Data Model (CDM) v5.1/5.2 to v5.4 within a secure Microsoft Fabric environment (HIPAA-compliant).
Tool Re-engineering:
- OHDSI's Data Quality Dashboard (DQD) was originally built in R/Java, which faced compatibility issues with MS Fabric's Java versions.
- The authors developed a Python-based DQD extension. This tool generates SQL scripts for data quality checks, interacts with the database via MS Fabric APIs, and visualizes results using Power BI.
- The new Python implementation was validated against the original OHDSI R-based DQD to ensure identical SQL generation and results.

B. Trustworthy AI (TAI) Integration

Framework Extension: The authors extended the standard DQD by integrating the METRIC framework (Measurement Process, Timeliness, Representativeness, Informativeness, Consistency).
Specific Assessments: Four dimensions were operationalized into automated queries:
1. Informative Missingness: Analyzing patterns of missing data across hospital sites and data types.
2. Timeliness: Evaluating mapping consistency between legacy (ICD-9) and modern (ICD-10) coding systems.
3. Distribution Consistency: Checking for uniformity in data distributions across different hospital sites.
4. Representativeness/Informativeness: Addressed via clinical collaboration (not fully automated due to complexity).

C. Case Study: Craniofacial Microsomia (CFM)

Objective: To evaluate the impact of data harmonization on AI model performance.
Task: Classifying patients with psychiatric diagnoses based on their surgical history and conditions.
Data Harmonization:
- Compared models trained on Source Codes (raw ICD-9, ICD-10, SNOMED, CPT4) vs. OMOP CDM Concept Codes (harmonized).
- Tested the effect of Supersets (grouping concept codes to reduce feature dimensionality).
FHIR Integration: Mapped data to HL7 FHIR resources (Patient, Condition, Procedure) to assess interoperability for future AI applications, despite challenges in the closed MS Fabric environment.
Models: Random Forest (RF), XGBoost, and AdaBoost evaluated using 5-fold Cross-Validation and AUROC.

3. Key Contributions

Real-World Evidence (RWE): Provided a comprehensive case study of modernizing a multisite healthcare data warehouse (240GB+, billions of data points) from OMOP v5.2 to v5.4.
Python-Based DQD Extension: Successfully ported OHDSI's DQD to a Python/MS Fabric environment, resolving compatibility barriers and enabling automated, scalable data quality monitoring.
TAI-Enhanced Quality Assessment: Integrated the METRIC framework into standard data quality checks, moving beyond simple conformance to evaluate informative missingness, timeliness, and distributional consistency.
Implementation Science Insights: Demonstrated the necessity of a hybrid approach, blending systematic infrastructure standardization with case-specific, clinician-driven data curation.

4. Key Results

Infrastructure Modernization:
- Modernization to OMOP v5.4 resulted in a 4% increase in overall data quality test success rates (84.78% $\to$ 88.88%) and an 8% increase in conformance (80.73% $\to$ 88.09%).
- However, 100% conformance was not achieved due to data placement errors (e.g., observations in procedure tables), highlighting the need for continued manual investigation.
METRIC Framework Findings:
- Missingness: Data is not missing completely at random; patterns depend on the hospital site and data source.
- Timeliness: Only 50% of ICD-9 codes shared a common mapping with ICD-10, posing a risk for model performance degradation when switching coding systems.
- Distribution: Data distributions varied significantly by site (reflecting clinical specialization), though vital signs remained the most prevalent data type across all sites.
AI Model Performance (CFM Case Study):
- Harmonization Impact: Data harmonization (Source $\to$ OMOP) did not significantly degrade model performance (Mean AUROC: 71.3% for Source vs. 70.0% for OMOP).
- Feature Reduction: Reducing features by grouping OMOP concepts into supersets decreased model performance, suggesting that granularity is critical for this specific task.
- FHIR: While FHIR resources were successfully created, systematic implementation was hindered by MS Fabric's closed API environment and the complexity of mapping OMOP IDs back to FHIR source codes.

5. Significance and Conclusion

Bridging the Gap: The study proves that high-quality data infrastructure is a prerequisite for Trustworthy AI. Without addressing data fidelity, standardization, and missingness, AI models cannot be reliably deployed.
Hybrid Implementation Strategy: The authors argue that neither purely systematic approaches (infrastructure-only) nor purely case-specific approaches are sufficient. Successful AI implementation requires a blend of standardized data models (OMOP) and tailored, clinician-in-the-loop curation.
Future Directions: The work highlights the need for:
- Advanced anomaly detection using NLP and Large Language Models (LLMs) to validate clinical notes.
- Usability studies with end-users (clinicians and engineers).
- Further exploration of AI Implementation Science frameworks that incorporate "Safe AI" and "Actionable AI" principles.

In summary, this paper provides a blueprint for large healthcare systems to modernize their data ecosystems, integrate Trustworthy AI principles into routine quality checks, and navigate the complex trade-offs between data standardization and model performance.