Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids

Imagine you are wearing a pair of high-tech hearing aids. These devices are amazing at making the world louder, but they have a funny quirk: your own voice sounds weirdly loud and booming, like you're shouting inside a metal bucket. This is because the sound of your voice travels through your skull and tissues directly to the microphone, bypassing the air.

To fix this, hearing aid manufacturers want a "magic switch" that knows the difference between you talking and someone else talking. If it's you, the device turns down the volume for comfort. If it's someone else, it keeps the volume up so you can hear them.

The problem? Most current "magic switches" need two or more microphones (like a stereo system) or extra sensors to work. This makes the hearing aid bulky, expensive, and hard to fit in small ears.

This paper presents a clever new solution: A "One-Microphone Magic Switch" that learns by playing video games.

Here is how they did it, broken down into simple concepts:

1. The Problem: The "Real World" is Too Hard to Measure

To teach a computer to tell the difference between your voice and a stranger's, you usually need to record thousands of people talking in thousands of different rooms, with thousands of different head shapes. It's like trying to take a photo of every possible angle of a mountain in every weather condition. It's too expensive, too slow, and physically impossible to do perfectly.

2. The Solution: The "Virtual Reality" Training Camp

Instead of recording real people, the researchers built a virtual reality simulator for sound.

The Analogy: Imagine you are training a dog to catch a ball. Instead of throwing a real ball in a real park (which is unpredictable), you first train it in a video game where the physics are perfect. Once the dog gets really good at the game, you take it to the real park.
The Method: They created a computer model of a human head (starting as a simple ball, then a detailed head, then a head with a torso). They simulated sound waves hitting this virtual head from every angle.
- Own Voice: They simulated sound coming from a "mouth" vibrating on the head itself.
- Other Voice: They simulated sound coming from a "speaker" floating in the air.

3. The "Video Game" Levels (Progressive Learning)

The computer didn't just learn on one simple model. They used a hierarchical training strategy, like leveling up in a video game:

Level 1 (The Simple Ball): They started with a basic, rigid sphere. This taught the AI the basic rules of how sound bounces off a round object.
Level 2 (The Human Head): They upgraded to a detailed 3D model of a human head. Now the AI learned how ears and nose shape the sound.
Level 3 (Head & Torso): Finally, they added the shoulders and chest. This is the "hard mode" that mimics real life, where your shoulders block and reflect sound in complex ways.

By starting simple and getting harder, the AI learned the "physics of sound" without needing a million real-world recordings.

4. The "Magic Brain" (The AI Classifier)

They used a type of AI called a Transformer (the same family of technology behind modern chatbots). This AI looked at the sound waves and asked: "Does this sound pattern look like it came from inside the head (me) or from outside the head (someone else)?"

Because they trained it on their "Virtual Reality" data, the AI learned to spot the subtle "fingerprint" of your own voice, which sounds different because it travels through your bones and tissues, not just through the air.

5. The Real-World Test

After training in the "video game," they tested the AI on real recordings from actual hearing aid prototypes.

The Result: Even though the AI had never "heard" a real human voice before, it got 80% accuracy on real-world data.
The Secret Sauce: To bridge the gap between the "video game" and reality, they used a tiny trick called feature compensation. Think of it like putting on a pair of glasses that corrects the color distortion between the virtual world and the real world. This allowed the AI to work without needing to be retrained on real data.

Why This Matters

Cheaper & Smaller: You only need one microphone. This means hearing aids can be smaller, cheaper, and easier to wear.
Better Comfort: The device can instantly know when you are talking and lower the volume for you, so you don't feel like you're shouting, while still hearing others clearly.
Scalable: Because they used simulations, they can easily test thousands of different head shapes and sizes without needing a single human volunteer.

In a nutshell: The researchers taught a computer to recognize your voice by letting it "dream" about sound waves in a virtual world, starting with simple shapes and graduating to complex human anatomy. This allowed them to build a smart, single-microphone hearing aid that knows the difference between "you" and "them."

1. Problem Statement

Own Voice Detection (OVD) is critical for hearing aids to distinguish between the wearer's speech and external speech. When a user speaks, hearing aids often amplify their own voice unnaturally or too loudly, causing discomfort. While existing solutions use multi-microphone arrays, bone-conduction sensors, or complex beamforming, these approaches increase hardware cost, power consumption, and calibration complexity. They are also unsuitable for users with unilateral hearing loss or entry-level devices.

The specific challenge addressed in this paper is developing a robust, single-microphone OVD system that:

Does not rely on expensive, device-specific transfer function measurements.
Avoids reliance on speaker-dependent vocal characteristics (which require long training times).
Generalizes effectively from simulated data to real-world hearing aid recordings.

2. Methodology

The authors propose a simulation-based data augmentation strategy combined with a Transformer-based classifier. The core idea is to model the distinct acoustic propagation paths of the wearer's voice versus external speakers using Acoustic Transfer Functions (ATFs).

A. Acoustic Modeling & Simulation Pipeline

Instead of collecting massive datasets of real-world measurements, the authors generate synthetic ATFs using a two-stage hierarchical approach:

Analytical Modeling (Rigid Sphere):
- External Speaker: Modeled as a point source scattering off a rigid sphere (representing the head).
- Own Voice: Modeled as a vibrating spherical cap (representing the mouth) on the rigid sphere.
- This stage provides a wide range of spatial variations (angles, distances, head radii) with controlled mathematical precision.
Numerical Simulation (High-Fidelity):
- Uses Mesh2HRTF (Boundary Element Method) to simulate sound propagation on increasingly realistic geometries:
  - Rigid Sphere
  - Human Head
  - Head-and-Torso
- This stage captures complex diffraction, scattering, and anatomical details missing in the analytical model.

B. Data Augmentation Strategy

Source Data: Clean speech from the VoxCeleb1 dataset.
Augmentation: The clean speech is convolved with the generated ATFs to simulate how the sound reaches the hearing aid microphone for both "own voice" and "external speaker" scenarios.
Noise Robustness: Background noise from the MUSAN dataset is added at various Signal-to-Noise Ratios (SNR: 0–30 dB).
Real-World Adaptation: To bridge the "sim-to-real" gap, a lightweight test-time feature compensation is applied. This aligns the statistical distribution (mean and variance) of real hearing aid recordings with the simulated training data without requiring model retraining.

C. Model Architecture

Classifier: A Conformer-based encoder (a Transformer variant optimized for speech) with temporal gate pooling.
Input: Log-mel spectrograms derived from the ATF-augmented STFT.
Task: Segment-level binary classification (Own Voice vs. External Speaker).
Training Strategy: Progressive Fine-tuning.
1. Pre-train on Analytical ATFs (broad spatial coverage).
2. Fine-tune on Numerical ATFs, transitioning from Rigid Sphere $\to$ Head $\to$ Head-and-Torso.

3. Key Contributions

Single-Microphone OVD via Simulation: Demonstrated that high-accuracy OVD is achievable with a single microphone by leveraging spatial cues embedded in simulated ATFs, eliminating the need for multi-mic hardware or bone sensors.
Hierarchical ATF Generation: Introduced a scalable pipeline moving from analytical (rigid sphere) to numerical (head-and-torso) simulations, allowing the model to learn general spatial principles before adapting to anatomical specifics.
Sim-to-Real Generalization: Proved that models trained entirely on synthetic data can generalize to real hearing aid recordings (80% accuracy) using only lightweight feature compensation, avoiding the need for costly real-world data collection for training.
Efficient Architecture: Validated a lightweight Conformer model (0.9M parameters) suitable for potential deployment on hearing aid hardware.

4. Experimental Results

Simulated Performance:
- Achieved 95.52% accuracy on simulated Head-and-Torso test data (15-second utterances).
- Maintained 90.02% accuracy on short 1-second utterances, demonstrating low-latency viability.
- Showed robustness across varying SNRs and noise types (Music, Speech, Noise) on the LibriSpeech dataset.
Real-World Performance:
- Tested on recordings from a hearing aid prototype.
- Achieved 80.00% accuracy without fine-tuning on real data, using only test-time feature compensation.
- An "oracle" setting (using ground truth for compensation) reached 86.50%, indicating the feature compensation method is highly effective.
Baseline Comparison:
- Compared against a ResNet-based single-mic OVD model (López-Espejo et al.).
- The proposed Conformer-small model achieved 96.66% overall accuracy on a measured-ATF benchmark, outperforming the ResNet baseline (94.30%), particularly in external speaker detection.
Feature Analysis:
- Ablation studies confirmed the model relies on spatial acoustic cues (ATFs) rather than simple amplitude/loudness differences. Removing ATF input dropped accuracy to near-random (49.67%).

5. Significance and Future Work

This work represents a significant step toward cost-effective, accessible hearing aids. By replacing expensive hardware sensors and complex multi-mic processing with advanced simulation and machine learning, the authors demonstrate a path to high-performance OVD for entry-level and single-ear devices.

Limitations & Future Directions:

The current study focuses on offline segment-level detection.
Future work aims to implement causal, real-time streaming with low latency suitable for real-time hearing aid processing.
Further reduction of model complexity is needed for deployment on resource-constrained hearing aid chips.

In summary, the paper successfully validates that simulated acoustic physics, when combined with deep learning, can solve a critical real-world problem in assistive listening devices without the prohibitive costs of traditional hardware solutions.