Corpus for Benchmarking Clinical Speech De-identification

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a giant, noisy library filled with millions of recorded doctor's visits. These recordings are goldmines for researchers trying to teach computers how to understand human speech. But there's a huge problem: these recordings are full of private secrets—patients' names, addresses, social security numbers, and specific dates of illness. If you just hand these recordings to a computer, it might accidentally learn those secrets and leak them later.

To fix this, we need a "privacy filter" that can listen to the audio, find the secrets in real-time, and silence them before anyone else hears them. But to build a good filter, you need a training manual. This paper is about creating that manual.

Here is the story of how the authors built the SREDH-AICup SHI speech corpus, explained simply:

1. The Missing Piece of the Puzzle

For years, researchers had two types of tools, but neither was perfect:

Text Books: They had thousands of written medical notes where sensitive words were already crossed out and replaced with fake names (like "Patient A"). But these were just text, not audio.
Audio Books: They had thousands of hours of people speaking, but these were mostly general conversations (like people ordering coffee or reading news) or medical talks that didn't have the "secret-crossing-out" labels attached to the specific moments in the audio.

The Gap: They needed a dataset that was both audio and had a precise map showing exactly when a secret word was spoken. It's like having a movie where the subtitles don't just say "Secret Name," but tell you exactly which second the actor said it, so you can mute just that split-second.

2. Building the "Privacy Gym"

The authors decided to build a "gym" where computers could practice their privacy skills. They combined three different ingredients to make a special recipe:

Ingredient A (The Script): They took existing written medical notes (from a dataset called OpenDeID) where the secrets were already marked. They hired actors to read these scripts aloud naturally.
Ingredient B (The Real Talk): They used recordings from a psychiatric dataset (DAMT) where doctors and patients were already talking.
Ingredient C (The Drama): They grabbed clips from Taiwanese medical TV dramas. Why? Because real-life medical dramas often have doctors discussing patients, and they wanted to add some Mandarin Chinese to the mix to make it bilingual.

3. The "Millisecond" Hunt

Once they had the audio, they needed to label it. This wasn't a quick job.

The Team: A group of trained annotators listened to the audio.
The Task: They had to find 38 different types of secrets (like "Patient Name," "Hospital," "Phone Number," "Date of Birth").
The Precision: They didn't just say, "The name was spoken." They had to mark the exact start and end time of the word down to the millisecond.
The Analogy: Imagine a game of "Whac-A-Mole," but instead of hitting moles, they are hitting specific words in a conversation. If the mole (the secret word) pops up for 0.5 seconds, they have to hit it exactly within that 0.5 seconds. If they are off by even a tiny bit, the computer won't learn correctly.

They practiced this "hunting" game over and over (12 rounds!) until the team agreed on the timing 90% of the time. This ensured the map they created was incredibly accurate.

4. The Result: A 20-Hour Training Camp

The final product is a dataset of 20 hours of audio.

The Breakdown: It's split into three piles: one for learning (training), one for checking progress (validation), and one for the final exam (testing).
The Content: It contains about 7,830 "secrets" hidden inside the speech.
The Languages: It's mostly English (about 19 hours), with a small but valuable slice of Mandarin Chinese (about 1 hour).
The Reality Check: The data looks like real life. Some secrets (like "Date") appear constantly, while others (like "Passport Number") are rare. This "long tail" distribution is actually good because it mimics the messy reality of real hospitals, where some info is everywhere and some is rare.

5. Why This Matters

Think of this dataset as a flight simulator for privacy.
Before this, if you wanted to build a system that protects patient privacy in real-time (like a doctor speaking into a microphone while a computer transcribes it), you were flying blind. You didn't have a simulator to test your system.

Now, researchers can plug their new "privacy filters" into this dataset. They can see:

Does the computer catch the name?
Does it catch the phone number?
Does it do it fast enough to stop the secret from leaking?

The Bottom Line

This paper is a gift to the scientific community. It provides the first high-quality, time-aligned "training ground" for teaching computers how to listen to medical conversations and instantly scrub out private information. It paves the way for a future where we can use AI to help doctors without ever worrying that the AI will accidentally reveal a patient's identity.

In short: They built a realistic, bilingual, high-precision "privacy obstacle course" so that future AI systems can learn to protect our most sensitive health secrets.

Metric	Value
Total Duration	20 hours
Total Files	3,024 (1,539 Train / 775 Val / 710 Test)
Language Distribution	19.36 hrs English (96.8%) / 0.89 hrs Mandarin (4.2%)
Source Contribution	59% DAMT, 36% OpenDeID v2 (re-recorded), 5% PTS
Audio Quality	Mean SNR > 28 dB across all subsets (Max 90.20 dB)
Entity Distribution	Long-tail distribution; most frequent: DATE (1,811), DOCTOR (1,365), PATIENT (828). Least frequent: URL (1), PHONE (2).

Corpus for Benchmarking Clinical Speech De-identification

1. The Missing Piece of the Puzzle

2. Building the "Privacy Gym"

3. The "Millisecond" Hunt

4. The Result: A 20-Hour Training Camp

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

Data Sources & Integration

Annotation Schema & Process

3. Key Contributions

4. Results & Dataset Statistics

5. Significance and Impact

Corpus for Benchmarking Clinical Speech De-identification

1. The Missing Piece of the Puzzle

2. Building the "Privacy Gym"

3. The "Millisecond" Hunt

4. The Result: A 20-Hour Training Camp

5. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

Data Sources & Integration

Annotation Schema & Process

3. Key Contributions

4. Results & Dataset Statistics

5. Significance and Impact

More like this

A case report on gendered biases in a Finnish healthcare AI assistant

An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science

Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study