Imagine a massive hospital as a giant library. Inside this library, there are different teams: the Doctors (who know about patient health), the Fundraisers (who need to know about donors), and the Operations Team (who manage wait times).
The Problem: The "Glass Wall"
In the past, if the Fundraisers wanted to know something about patients to help their campaigns, they had to ask the Doctors for the raw files. But these files are like glass boxes containing everyone's private secrets (names, addresses, medical conditions). Laws like HIPAA say, "You cannot open these boxes and hand them over."
So, the teams hit a wall. They can't share data, and they can't make good decisions.
The Current "Fix": The Summary Report
To solve this, the hospital started creating Summary Reports (Aggregated Metrics). Instead of giving the Fundraisers a list of 10,000 patients, the Doctors give them a single number: "The average wait time in the ER is 20 minutes."
This is great! It's safe, right? Not always.
Imagine a summary report that says: "There is exactly one 85-year-old woman with a rare disease living in Zip Code 90210." Even though it's a "summary," it actually reveals exactly who that person is. This is called a Privacy Leak.
The Solution: The "AI Safety Inspector"
This paper proposes a new, smart tool to act as a Virtual Safety Inspector for these summary reports. Before a report is ever published, this AI checks the "recipe" (the SQL code) used to make it to see if it's dangerous.
Here is how the AI works, using a simple analogy:
1. The Grammar Teacher (SQL Parser)
First, the AI reads the recipe like a grammar teacher reading a sentence. It breaks the sentence down into parts (Subject, Verb, Object) to understand the structure. It asks: "Are you trying to group people by their Zip Code? Are you counting them by Gender?"
2. The Mind Reader (CodeBERT)
Sometimes, two recipes look different but mean the same dangerous thing.
- Recipe A: "Group by Zip Code."
- Recipe B: "Group by City + Street Name."
A simple rule-finder might miss Recipe B. But our AI has a Mind Reader (called CodeBERT). It understands the intent behind the code. It knows that "City + Street" is just as dangerous as "Zip Code" because both narrow down the group too much. It turns the recipe into a "fingerprint" to see if the idea is risky.
3. The Checklist (Syntactic Features)
The AI also pulls out a physical checklist. It counts:
- How many tables are being mixed?
- Are there "sensitive" ingredients (like Birth Date or Medical Codes) in the mix?
- Is the group size too small?
4. The Judge (XGBoost Classifier)
The AI combines the "Mind Reader's" understanding and the "Checklist" results and hands them to a Judge (an XGBoost model).
- The Judge looks at all the evidence and gives the recipe a Risk Score from 0 to 1.
- Score 0.2? Safe. "Go ahead, publish the report."
- Score 0.9? Dangerous! "Stop! This recipe will reveal private secrets."
5. The Translator (Explanation Engine)
If the Judge blocks a recipe, the system doesn't just say "No." It acts like a Translator and explains why in plain English:
"We blocked this because you are grouping by 'Gender' and 'Diagnosis Code.' This creates a group so small that someone could guess exactly who that patient is."
Why This Matters
- Old Way: You wait until someone publishes a bad report, then you panic and try to fix it. (Like waiting for a cake to burn before checking the oven).
- New Way: The AI checks the recipe before you even turn on the oven. It stops the fire before it starts.
The Result
Now, the Fundraisers can get the insights they need (like "How many donors are in the cardiology department?") without ever seeing a single patient's name or medical record. The hospital stays safe, follows the law, and everyone can work together without fear.
In short: This paper builds a smart, automated guard that checks every data summary to make sure it doesn't accidentally spill anyone's secrets, allowing hospitals to share knowledge safely.