Precision risk assessment for pediatric hospitalization using address-level data in Cincinnati, Ohio

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to find the "sick spots" in a city to help children stay healthy. Usually, doctors and city planners look at neighborhoods like big, blurry blobs on a map. They might say, "This whole neighborhood has a lot of sick kids." But that's like trying to find a specific leak in a house by just looking at the whole roof—you know there's a problem, but you don't know exactly where to put the bucket.

This paper is about building a super-precise, address-by-address "health radar" for Cincinnati, Ohio. Instead of looking at neighborhoods, the researchers zoomed in all the way to individual house addresses to figure out which specific homes are most likely to send a child to the hospital.

Here is how they did it, broken down into simple parts:

1. The Ingredients: Mixing Data Like a Smoothie

The researchers took three different types of information and blended them together:

The Medical Record: They looked at 6 years of hospital visits for kids in Cincinnati.
The House Report Card: They grabbed data on every single house (like when it was built, how much it's worth, and if it has code violations like mold or broken stairs).
The Neighborhood Vibe: They added data about the street, like how much crime happens nearby and what the neighborhood looks like (poverty levels, education, etc.).

Think of this like making a smoothie. You take the "fruit" (hospital data), the "vegetables" (house conditions), and the "milk" (neighborhood stats) and blend them into one powerful drink that tells a story about risk.

2. The Engine: A Smart Computer Brain

They used a special type of computer brain called a Machine Learning Model (specifically, a "Generalized Random Forest").

The Analogy: Imagine you have a giant forest of decision trees. Each tree asks a question like, "Does this house have a mold violation?" or "Is there a lot of violent crime on this block?"
The computer asks thousands of these questions for every single address in the city. It then combines all the answers to give every house a "Risk Score."
A high score means, "Hey, this specific address is a hotspot for sick kids." A low score means, "This place is pretty safe."

3. The "Birth Adjustment": Fixing the Math

There was a tricky problem. Some houses are huge apartment buildings with 50 families, while others are small cottages with just one family. If the apartment building has 5 kids in the hospital, is that worse than the cottage having 5 kids?

The Cottage: 5 kids out of 1 family = Catastrophic.
The Apartment: 5 kids out of 50 families = Not that bad.

To fix this, the researchers created a "Birth-Adjusted" score. They subtracted the number of babies born at that address from the number of hospital visits. This levels the playing field, so the computer isn't just counting total numbers, but looking at the rate of sickness relative to how many kids actually live there.

4. What Did They Find?

The model worked incredibly well. It could spot the "sick spots" with almost perfect accuracy.

The Top Culprits: The biggest red flags for a high-risk address were housing code violations (like peeling paint or pests), violent crime nearby, and the value of the property (lower value often meant higher risk).
The "Avondale" Example: They showed a map of a neighborhood called Avondale. Instead of coloring the whole neighborhood red, their model lit up specific red dots on specific streets and even specific buildings, showing exactly where the risk was highest.

5. Why Does This Matter? (The "So What?")

This isn't just a math game; it's a tool for saving lives and money.

For Doctors: Instead of guessing which families need help, a doctor could see a child's address and say, "Oh, this house has a high risk score because of mold and crime. Let's connect this family with a housing lawyer or a social worker immediately."
For City Planners: Instead of sending inspectors to random houses, the city can send them to the exact addresses the model flagged. It's like using a metal detector to find buried treasure instead of digging holes all over the beach.
Privacy: Because the score is attached to the address and not a specific child's name, it protects patient privacy while still helping the community.

The Catch (Limitations)

The researchers were honest about the flaws:

The "Complaint" Bias: Housing violations are often reported by neighbors. If a neighborhood is poor or minority, they might be under-reported because people are afraid to call, or over-reported because of bias.
The "Moving" Problem: The model uses birth records to guess how many kids live somewhere. But kids move! A family might have a baby at one address and move to another a year later.
Fairness: They found the model was slightly less accurate for neighborhoods with fewer white residents. This is a warning sign that the data itself might be biased, and they need to be careful not to accidentally ignore the communities that need help the most.

The Bottom Line

This paper is like upgrading from a blurry, wide-angle lens to a high-definition microscope. It shows us that to fix child health problems, we can't just look at neighborhoods; we have to look at the specific houses, the specific streets, and the specific conditions that make a child sick. It's a roadmap for hitting the problem right where it hurts, rather than guessing.

1. Problem Statement

Persistent health disparities in pediatric populations result in disproportionately high hospitalization rates for children in low-income and minoritized communities. Traditional population health approaches rely on area-level data (e.g., ZIP codes, census tracts), which suffer from several limitations:

Ecological Fallacy: They assume homogeneity among individuals within a geographic unit, masking specific risks at the household level.
Spatial Granularity: Boundaries are often ill-defined or shift over time, and data updates are infrequent.
Intervention Scalability: Deploying interventions at the neighborhood scale is often resource-intensive and lacks precision.
Data Sparsity: While address-level data offers higher precision, hospitalization events are rare at the individual address level, creating sparse datasets that are difficult to model using traditional statistical methods.

The study aims to bridge this gap by developing a machine learning framework that integrates residence- and neighborhood-level socio-environmental data with population-wide healthcare data to generate address-level risk scores for pediatric hospitalizations.

2. Methodology

Study Setting and Data Sources

Location: Cincinnati, Ohio (Hamilton County), covering approximately 65,000 children.
Population: 77,077 residential addresses linked to 10,085 pediatric hospitalizations at Cincinnati Children's Hospital Medical Center (CCHMC) between July 1, 2016, and June 30, 2022.
Data Integration:
- Healthcare Data: Electronic Health Records (EHR) from CCHMC.
- Residence-Level Data (11 features): Parcel data from the Hamilton County Auditor and Cincinnati Department of Buildings & Inspections. Includes housing code violations, property type, year built, market value, and crime incidents (violent and non-violent) within a 200m radius.
- Neighborhood-Level Data (19 features): U.S. Census American Community Survey (2019 5-year estimates) and Eviction Lab data, linked via 2010 census tracts. Includes poverty rates, education levels, housing density, and eviction filings.
Address Matching: Utilized the addr package (NLP-based) to clean and match EHR addresses to parcel identifiers, achieving an 81.5% match rate.

Model Development

Algorithms: Generalized Random Forest (GRF) models were employed to handle non-linear relationships, complex interactions, and missing data without imputation.
Outcomes Modeled:
1. Hospitalization Risk Model: Predicts the total number of hospitalizations per address.
2. Birth-Adjusted Hospitalization Risk Model: Normalizes risk by subtracting the number of births at an address from the number of hospitalizations. This accounts for the density of children residing at a location (e.g., distinguishing a single-family home with 5 admissions from a large apartment complex with 5 admissions).
Feature Engineering: 30 input features were harmonized. For addresses matching multiple parcels, median or mean values were calculated.

Evaluation and Fairness

Performance Metrics: ROC-AUC, Precision-Recall AUC (PR-AUC), Sensitivity, Specificity, PPV, and NPV at varying diagnostic thresholds.
Fairness Assessment: Evaluated across four subgroups based on census block-level racial composition (proportion of White residents). Metrics included Equalized Odds (max difference in sensitivity/specificity) and Equal Opportunity (max difference in sensitivity).
Temporal Validation: The hospitalization risk model was validated against data from July 2022–June 2023 (one year post-training).

3. Key Contributions

Address-Level Precision: The study moves beyond census tract-level analysis to the parcel/address level, enabling "precision population health" where interventions can be targeted to specific properties rather than broad neighborhoods.
Birth-Adjusted Risk Scoring: Introduces a novel method to normalize hospitalization counts by birth records, isolating the risk of the environment from the density of the population.
Privacy-Preserving Granularity: The resulting ARCH (Address-level Risk for Child Hospitalization) scores allow for the sharing of granular health insights with policymakers and community partners without linking back to individual Protected Health Information (PHI).
Multiscale Data Integration: Demonstrates a scalable framework for integrating disparate data sources (EHR, municipal code enforcement, police crime data, census data) to model complex socio-environmental determinants of health.

4. Key Results

Model Performance

Hospitalization Risk Model:
- ROC-AUC: 0.99 (top 2.4% of addresses) and 0.98 (top 7.4%).
- PR-AUC: 0.65–0.72.
- Top Features: Housing code violations, violent crime, market total value, year built, and fraction of houses built before 1970.
Birth-Adjusted Model:
- ROC-AUC: 0.93 (top 2.4%) and 0.92 (top 3.5%).
- Agreement: Moderate agreement ( $\kappa = 0.43$ ) with the raw hospitalization model, indicating that birth adjustment significantly alters risk classification.
Temporal Robustness: The model maintained fair performance in temporal validation (ROC-AUC 0.75 for the following year), outperforming models based solely on historical hospitalization counts (ROC-AUC 0.68).

Fairness Analysis

Disparities: The birth-adjusted model showed the largest disparity in sensitivity (0.14 difference) between the quartile with the highest proportion of White residents and the lowest.
Drivers of Bias: Differences in property type distribution (e.g., single-family homes vs. multi-unit apartments) across racial demographics and potential biases in complaint-driven housing code violation data contributed to these disparities.

Interpretability

Decision trees and partial dependence plots revealed that housing code violations and 40+ unit apartment buildings were primary split points for risk stratification.
Market total value showed a positive relationship with raw hospitalization risk but a negative relationship with birth-adjusted risk, suggesting that high-value properties may have lower risk per child when population density is accounted for.

5. Significance and Future Directions

Clinical and Policy Impact: ARCH scores can drive targeted interventions, such as:
- Referrals to medical-legal partnerships for housing issues.
- Prioritizing housing inspections for code enforcement.
- Resource allocation for community safety and tenant rights advocacy.
Scalability: The framework is adaptable to other municipalities with open data infrastructure (e.g., parcel data, crime logs).
Limitations:
- Generalizability: Currently limited to Cincinnati; requires data harmonization for broader application.
- Data Bias: Reliance on complaint-driven housing data and vital records (births) as proxies for residency may introduce bias, particularly for foster care populations or highly mobile families.
- Fairness: The observed sensitivity disparities highlight the need for debiasing techniques and post-calibration adjustments before deployment.

Conclusion: This study successfully demonstrates that integrating high-resolution address-level data with machine learning can identify pediatric health risks with unprecedented spatial precision. It offers a pathway to move from reactive, area-based public health strategies to proactive, precision-targeted interventions that address the root socio-environmental causes of pediatric hospitalization.