Can Machine Learning Algorithms use Contextual Factors to Detect Unwarranted Clinical Variation from Electronic Health Record Encounter Data during the Treatment of Children Diagnosed with Acute Viral Pharyngitis

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are the manager of a large chain of coffee shops. You have a strict rule: "Never serve decaf to customers who ordered espresso." It's a simple rule, right? But when you look at your sales data, you notice something strange. Some baristas are serving decaf to espresso lovers way more often than others.

Why is this happening? Is it because the customers are demanding it? Is it because the baristas are tired? Or is it just that some shops have different cultures?

This is exactly the problem doctors face with Acute Viral Pharyngitis (a sore throat caused by a virus). The medical rule is clear: Do not give antibiotics for a viral sore throat. Antibiotics kill bacteria, not viruses, so they don't help and can even cause harm. Yet, doctors still prescribe them too often. This is called Unwarranted Clinical Variation (UCV)—doing things differently when you shouldn't.

The authors of this paper asked: "Can we use a smart computer program (Machine Learning) to spot these 'bad coffee orders' automatically, just by looking at the digital records?"

Here is the story of how they did it, explained simply:

1. The Detective Work: Finding the "Bad Orders"

The researchers looked at electronic health records (EHR) from children who visited clinics with sore throats. They wanted to find the visits where a doctor gave antibiotics when they shouldn't have.

The Challenge: Usually, to know if a doctor made a mistake, you need a human expert to read the doctor's notes and say, "Yes, that was wrong." This is slow, expensive, and boring.
The Hack: They tried two types of "labels" for their computer:
- Gold Standard: Humans read the notes and manually marked the mistakes. (Accurate but slow).
- Weak Labels: The computer just looked at the raw data (e.g., "Did they prescribe antibiotics?") without a human reading the notes first. (Fast and easy).
The Surprise: The computer learned almost just as well from the "Weak Labels" as it did from the "Gold Standard." It's like teaching a student to spot bad coffee just by looking at the receipt, rather than having them taste every cup.

2. The Smart Computer: The "Super-Scanner"

They didn't use just one type of computer brain. They tried three different "detectives" (Machine Learning algorithms):

Random Forest: Like a committee of 100 experts voting on whether a prescription was wrong.
CatBoost: A super-fast calculator that is great at handling messy data.
EBM (Explainable Boosting Machine): A detective that not only finds the mistake but explains why it thinks it's a mistake. This is crucial because doctors need to trust the computer.

The Result: All three were incredibly good at spotting the errors, with an accuracy score (AUC) of about 0.91. That's like getting 91 out of 100 questions right on a difficult test.

3. The "Why": What Actually Caused the Mistakes?

Once the computer found the mistakes, the researchers asked: "What factors made a doctor more likely to break the rules?"

They didn't look at the patient's symptoms (because the rule is the same for everyone). Instead, they looked at Contextual Factors—the environment around the doctor.

Think of it like this: If a barista keeps making bad coffee, is it because they are a bad person, or because the shop is chaotic?

The computer found the top 5 "Contextual Clues":

How Busy the Doctor Is (Case Volume): Surprisingly, doctors who saw fewer patients were actually less likely to prescribe antibiotics. Doctors who saw huge numbers of patients were more likely to just "play it safe" and prescribe antibiotics to avoid missing a diagnosis.
How Busy the Clinic Is: Similar to the doctor, busy clinics saw more "bad orders."
The Doctor's Degree: Nurse Practitioners (NPs) were less likely to prescribe unnecessary antibiotics than Medical Doctors (MDs).
Experience Level: Newer doctors (less experience) followed the rules better. Older, more experienced doctors sometimes relied on "gut feeling" and prescribed antibiotics just in case.
The Type of Visit: Whether it was a quick check-up or a longer visit mattered.

4. The "Secret Sauce": The UCVA Ontology

To make sure their computer could talk to other computers in other hospitals, they used a special dictionary called the UCVA Ontology.

Analogy: Imagine every hospital speaks a different dialect. One says "High Volume," another says "Busy." The Ontology is like a universal translator that says, "Okay, 'High Volume' and 'Busy' both mean the same thing." This allows different hospitals to compare their "bad coffee" rates fairly.

5. Why This Matters

Speed: We don't need to hire armies of humans to read charts. The computer can scan millions of records in seconds.
Trust: Because they used "Explainable" models, the computer can say, "I flagged this because the doctor is very experienced and the clinic is very busy," rather than just giving a black-box answer.
Scalability: This method can be used in any hospital without needing to send all their private data to a central server. The model learns locally.

The Bottom Line

The researchers proved that Machine Learning can act as a super-efficient quality control inspector for medical care. By looking at the "context" (how busy the doctor is, their experience, the clinic type), the computer can spot when doctors are breaking the rules of antibiotic stewardship.

It turns out, the "bad coffee" isn't usually because the barista is bad; it's often because the shop is too chaotic, or the barista is too experienced and relies on old habits. Now, we have a tool to gently nudge them back to the right recipe.

1. Problem Statement

Unwarranted Clinical Variation (UCV) refers to patient care that deviates from evidence-based guidelines without justification from patient clinical characteristics, needs, or preferences. In the context of pediatric acute viral pharyngitis, UCV manifests as the inappropriate prescription of antibiotics, which contradicts Infectious Disease Society of America (IDSA) guidelines.

The study addresses three specific challenges in detecting UCV:

Complexity: Clinical care involves multiplicity of decisions, making it difficult to judge appropriateness using traditional statistical methods that rely on relative comparisons (e.g., small-area analysis).
Data Limitations: Existing methods often require centralized data aggregation and cannot detect absolute variation (deviation from a standard) effectively.
Interpretability: Many high-performing machine learning (ML) models (e.g., deep learning, ensembles) are "black boxes," lacking the explainability required for clinical adoption.
Labeling Cost: Creating "gold-standard" labels for UCV via manual chart review is resource-intensive.

2. Methodology

Data Source and Study Population

Source: Retrospective data from the BIG-ARC clinical data warehouse (CDW) of UTHealth (academic health system), standardized via the PCORnet Common Data Model.
Timeframe: January 1, 2021 – December 30, 2024.
Population: Children aged 3–19 years with an ICD-10 code J02.8 (acute pharyngitis due to other specified organism).
Exclusion Criteria:
- Positive Group A Streptococcus test (rapid or culture).
- Pre-existing conditions requiring antibiotics (e.g., otitis media).
- Lack of clinical notes.
Final Dataset: 132 encounters (112 no-treatment, 38 treatment). After chart review, 20 treatment cases were confirmed as unwarranted (gold standard).

Feature Engineering & Ontology

The study utilized Local Context Factors (LCFs) derived from EHR data, mapped to the UCVA Ontology (Unwarranted Clinical Variation and Attribution) to ensure standardization and interoperability.

Site-level: Case volumes (J02.8 and total diagnoses), categorized by terciles (Low/Medium/High).
Provider-level: Credentials (MD, NP, PA), specialty, years of experience, case volumes, sex, and physician indicator.
Patient-level: Socioeconomic status via the Area Deprivation Index (ADI) (National percentile and State decile), race, and sex.
Feature Sets: Nine distinct sets were created, ranging from data-driven features to domain-knowledge-based LCFs.

Modeling Pipeline

Algorithm Selection: Four models were trained on the ALLC feature set: Logistic Regression (LR), Random Forest (RF), Explainable Boosting Machine (EBM), and CatBoost.
- Evaluation: Nested cross-validation (10-fold outer, 10-fold inner for tuning).
- Metric: Area Under the Curve (AUC).
Feature Set Evaluation: CatBoost was trained on six different feature sets to determine the optimal combination of LCFs.
Weak vs. Gold-Standard Labels: The study compared models trained on Gold-Standard labels (manual chart review) versus Weak labels (inferred directly from EHR treatment records without manual review).
Interpretability:
- EBM: Used for global feature importance and partial dependence plots.
- CatBoost: Used SHAP (SHapley Additive exPlanations) values for feature importance.
Statistical Validation: Multi-level clustering was assessed using mixed-effects logistic regression (melogit) to calculate Intraclass Correlation Coefficients (ICC).

3. Key Results

Model Performance

Algorithm Comparison: Ensemble models significantly outperformed Logistic Regression.
- Random Forest: Median AUC = 0.90 (IQR 0.14).
- EBM: Median AUC = 0.89 (IQR 0.13).
- CatBoost: Median AUC = 0.89 (IQR 0.15).
- Logistic Regression: Mean AUC = 0.85.
Feature Set Impact: The LCF_DS feature set (domain-knowledge-based local context factors) yielded the best performance (Median AUC 0.91). Adding patient-level features did not significantly improve performance, suggesting UCV is driven primarily by provider/site context.
Final Model: A CatBoost model trained on LCF_DS achieved an AUC of 0.89 on a held-out test set (26 samples), with a Precision of 0.85, Recall of 0.79, and F1-score of 0.81.

Weak Labels vs. Gold Standard

Models trained on weak labels (inferred from EHR) performed comparably to those trained on gold-standard (chart-reviewed) labels.
- LCF Set: Gold Standard AUC = 0.92 vs. Weak AUC = 0.84 (No statistically significant difference).
- ALLC Set: Gold Standard AUC = 0.86 vs. Weak AUC = 0.85.
Implication: Manual chart review may not be strictly necessary for training UCV detection models in similar use cases, reducing resource costs.

Feature Importance & Drivers of UCV

The most influential predictors for unwarranted antibiotic prescribing were:

Case Volumes: Both provider-level and site-level case volumes were the top predictors.
- Counter-intuitive finding: Lower provider case volumes were associated with a reduced likelihood of inappropriate treatment. Conversely, higher volumes correlated with higher rates of unwarranted prescribing.
Provider Credentials: Nurse Practitioners (NPs) were less likely to prescribe inappropriately compared to MDs (though not statistically significant in pairwise tests, the trend was clear).
Experience: Providers with <10 years of experience were less likely to prescribe inappropriately than late-career providers.
Socioeconomic Status: Patients from high-needs ADI areas were less likely to receive inappropriate treatment compared to low-needs (high SES) areas.

Clustering Analysis

Initial mixed-effects models showed high ICCs, but after adjusting for covariates (LCFs), ICCs dropped to near zero ( $10^{-33}$ ).
This indicates that the identified contextual factors (volume, credentials, experience) fully explain the variation between sites and providers, leaving no residual clustering.

4. Key Contributions

Absolute UCV Detection: Demonstrated the feasibility of using ML to detect absolute deviation from guidelines (binary classification of appropriate vs. inappropriate) rather than just relative variation between providers.
Explainable AI (XAI) in UCV: Successfully applied Explainable Boosting Machines (EBM) and SHAP values to make ensemble model predictions interpretable for clinicians, identifying specific drivers of variation.
Weak Labeling Strategy: Validated that EHR-inferred weak labels can serve as a scalable alternative to expensive manual chart review for training UCV models.
Ontology Standardization: Utilized the UCVA Ontology to map EHR features, facilitating potential cross-institutional comparisons.
Novel Insights: Identified that lower provider volume and less experience correlate with better adherence to guidelines in this specific context, challenging assumptions that higher volume/experience always equals better adherence.

5. Significance and Limitations

Significance:

Scalability: Offers a scalable, ML-based alternative to traditional statistical methods for identifying UCV without requiring centralized data analysis.
Clinical Feedback: Provides a mechanism to generate actionable feedback for providers based on specific contextual factors (e.g., "High-volume providers in your network show higher deviation rates").
Cost Reduction: The weak-label finding suggests significant cost savings in model development for quality improvement initiatives.

Limitations:

Sample Size: The dataset was small (132 encounters), limiting the use of deep learning and external validation.
Generalizability: UCV is highly context-dependent; a model trained on one academic health system may not generalize to others without retraining on local data.
Retrospective Bias: Missing documentation (e.g., phone calls) or unrecorded patient preferences could lead to misclassification of "warranted" treatments as "unwarranted."
Single Use Case: The study focused strictly on acute viral pharyngitis; applicability to more complex, multi-step clinical pathways remains to be seen.

Conclusion

The study proves that classical machine learning algorithms, particularly ensemble methods like CatBoost and EBM, can effectively detect unwarranted clinical variation using routine EHR data and contextual factors. By leveraging explainable models and weak labeling, healthcare systems can potentially identify and mitigate unnecessary antibiotic prescribing in pediatric populations more efficiently than traditional statistical approaches.