Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance

Imagine a massive hospital as a giant library. Inside this library, there are different teams: the Doctors (who know about patient health), the Fundraisers (who need to know about donors), and the Operations Team (who manage wait times).

The Problem: The "Glass Wall"

In the past, if the Fundraisers wanted to know something about patients to help their campaigns, they had to ask the Doctors for the raw files. But these files are like glass boxes containing everyone's private secrets (names, addresses, medical conditions). Laws like HIPAA say, "You cannot open these boxes and hand them over."

So, the teams hit a wall. They can't share data, and they can't make good decisions.

The Current "Fix": The Summary Report

To solve this, the hospital started creating Summary Reports (Aggregated Metrics). Instead of giving the Fundraisers a list of 10,000 patients, the Doctors give them a single number: "The average wait time in the ER is 20 minutes."

This is great! It's safe, right? Not always.

Imagine a summary report that says: "There is exactly one 85-year-old woman with a rare disease living in Zip Code 90210." Even though it's a "summary," it actually reveals exactly who that person is. This is called a Privacy Leak.

The Solution: The "AI Safety Inspector"

This paper proposes a new, smart tool to act as a Virtual Safety Inspector for these summary reports. Before a report is ever published, this AI checks the "recipe" (the SQL code) used to make it to see if it's dangerous.

Here is how the AI works, using a simple analogy:

1. The Grammar Teacher (SQL Parser)

First, the AI reads the recipe like a grammar teacher reading a sentence. It breaks the sentence down into parts (Subject, Verb, Object) to understand the structure. It asks: "Are you trying to group people by their Zip Code? Are you counting them by Gender?"

2. The Mind Reader (CodeBERT)

Sometimes, two recipes look different but mean the same dangerous thing.

Recipe A: "Group by Zip Code."
Recipe B: "Group by City + Street Name."

A simple rule-finder might miss Recipe B. But our AI has a Mind Reader (called CodeBERT). It understands the intent behind the code. It knows that "City + Street" is just as dangerous as "Zip Code" because both narrow down the group too much. It turns the recipe into a "fingerprint" to see if the idea is risky.

3. The Checklist (Syntactic Features)

The AI also pulls out a physical checklist. It counts:

How many tables are being mixed?
Are there "sensitive" ingredients (like Birth Date or Medical Codes) in the mix?
Is the group size too small?

4. The Judge (XGBoost Classifier)

The AI combines the "Mind Reader's" understanding and the "Checklist" results and hands them to a Judge (an XGBoost model).

The Judge looks at all the evidence and gives the recipe a Risk Score from 0 to 1.
Score 0.2? Safe. "Go ahead, publish the report."
Score 0.9? Dangerous! "Stop! This recipe will reveal private secrets."

5. The Translator (Explanation Engine)

If the Judge blocks a recipe, the system doesn't just say "No." It acts like a Translator and explains why in plain English:

"We blocked this because you are grouping by 'Gender' and 'Diagnosis Code.' This creates a group so small that someone could guess exactly who that patient is."

Why This Matters

Old Way: You wait until someone publishes a bad report, then you panic and try to fix it. (Like waiting for a cake to burn before checking the oven).
New Way: The AI checks the recipe before you even turn on the oven. It stops the fire before it starts.

The Result

Now, the Fundraisers can get the insights they need (like "How many donors are in the cardiology department?") without ever seeing a single patient's name or medical record. The hospital stays safe, follows the law, and everyone can work together without fear.

In short: This paper builds a smart, automated guard that checks every data summary to make sure it doesn't accidentally spill anyone's secrets, allowing hospitals to share knowledge safely.

Query Type	Query Logic	Risk Score	Status	Outcome
Medium Risk	`GROUP BY zip`	0.87	BLOCKED	Correctly identified ZIP as a quasi-identifier with potential small group sizes.
High Risk	`GROUP BY gender, diagnosis_code`	0.93	BLOCKED	Flagged the combination of sensitive attributes leading to high re-identification risk.
Moderate Risk	`GROUP BY gender`	0.74	APPROVED	Nuanced Decision: Unlike a rigid rule engine that might block all gender grouping, the AI assessed the context and determined the risk was low (likely sufficient group size).

Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance

The Problem: The "Glass Wall"

The Current "Fix": The Summary Report

The Solution: The "AI Safety Inspector"

1. The Grammar Teacher (SQL Parser)

2. The Mind Reader (CodeBERT)

3. The Checklist (Syntactic Features)

4. The Judge (XGBoost Classifier)

5. The Translator (Explanation Engine)

Why This Matters

The Result

1. Problem Statement

2. Methodology

A. SQL Parsing (AST Generation)

B. Semantic Representation (CodeBERT)

C. Syntactic Feature Extraction

D. Risk Classification (XGBoost)

E. Explanation Engine

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance

The Problem: The "Glass Wall"

The Current "Fix": The Summary Report

The Solution: The "AI Safety Inspector"

1. The Grammar Teacher (SQL Parser)

2. The Mind Reader (CodeBERT)

3. The Checklist (Syntactic Features)

4. The Judge (XGBoost Classifier)

5. The Translator (Explanation Engine)

Why This Matters

The Result

1. Problem Statement

2. Methodology

A. SQL Parsing (AST Generation)

B. Semantic Representation (CodeBERT)

C. Syntactic Feature Extraction

D. Risk Classification (XGBoost)

E. Explanation Engine

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning