VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Imagine you have a friend who you've known for 15 years. If you asked a computer to recognize their voice today, it would be easy. But if you asked that same computer to recognize them using a recording from 15 years ago, it might get confused. Why? Because people's voices change as they age, just like their faces do. They get deeper, raspier, or softer.

Most computer systems today are trained on "snapshots"—they hear a voice once and assume it never changes. This paper introduces a solution to that problem: a massive new library of voices called VoxKnesset.

Here is the breakdown of what the researchers did, using some everyday analogies.

1. The Problem: The "Time Travel" Gap

Think of current voice AI (like Siri or Alexa) as a photographer who only takes one photo of you and then tries to recognize you in a year. If you've gained weight, grown a beard, or just aged, the photo doesn't match the reality.

For years, scientists didn't have enough data to teach computers how voices change over time. They had:

High-quality photos: But only one per person (like a yearbook).
Long-term videos: But they were blurry, small, or didn't have accurate names/ages attached.

They needed a "Time-Lapse Video" of human voices that was also accurate and huge.

2. The Solution: The "Parliamentary Time Machine"

The researchers found the perfect place to get this data: The Israeli Knesset (Parliament).

Why there?

The "Stage": Members of Parliament (MPs) speak in the same room, with similar microphones, for decades. It's a controlled environment.
The "Cast": There are hundreds of MPs. Some serve for 15+ years.
The "Script": The government keeps perfect, verified records of who spoke, when, and exactly what they said. No guessing ages from blurry photos; the records are official.

They built VoxKnesset, a dataset containing 2,300 hours of Hebrew speech from 393 different people, recorded over 16 years (2009–2025). It's like having a time-lapse camera on 393 different people, capturing their voices from their 30s all the way to their 80s.

3. The Experiment: Testing the "Aging" AI

The team took this new library and tested the world's best voice AI models to see how they handled aging. They asked three big questions:

A. Can the AI tell how old someone is?

The Result: Yes, but with a catch.
The Analogy: Imagine trying to guess someone's age by looking at a single photo. You might be right. But if you try to guess how much they aged between two photos taken 10 years apart, the AI gets confused.
The Finding: The AI was good at guessing a person's age based on a single snapshot (cross-sectional). But if you asked it to track the change over time, it failed. It couldn't tell the difference between "Person A is old" and "Person A got older."

B. Does aging break voice security?

The Result: Yes, significantly.
The Analogy: Imagine your voice is a key to your house. If you lose weight or get a cold, the key might still fit. But if you age 15 years, the key might not fit at all.
The Finding: The researchers tested "Speaker Verification" (using voice as a password). Over a 15-year gap, the error rate more than doubled. A system that was 98% accurate became much less reliable. This is a huge problem for banks or security systems that rely on voice ID.

C. Can we fix it?

The Result: Yes, if we train the AI differently.
The Analogy: Instead of teaching the AI to recognize a "static photo," we taught it to recognize a "movie."
The Finding: When they trained a model specifically on pairs of recordings from the same person (e.g., "This is the same guy, 5 years apart"), the AI learned to track the aging process. It could finally see the "temporal signal"—the subtle changes that happen as a person grows older.

4. Why This Matters

This isn't just about Hebrew speakers or politicians. It's about the future of how we interact with machines.

Security: If your voice is your password, we need to know how to update that password as you age, so you don't get locked out of your bank account in 10 years.
Health: Doctors might use voice analysis to detect diseases like Parkinson's or Alzheimer's. To do that, they need to know what "normal aging" sounds like so they don't mistake a natural change for a disease.
Fairness: Most AI is trained on young voices. This dataset helps make AI work better for older adults, who are often left out of the tech world.

The Bottom Line

The authors released this massive dataset (VoxKnesset) to the public. They are essentially handing the world a "Time-Lapse Video" of human voices to help engineers build smarter, more human-like AI that understands that we all change with time.

Here is a detailed technical summary of the paper "VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling."

1. Problem Statement

Speech processing systems face a critical vulnerability: vocal aging. The human voice undergoes continuous physiological changes due to the aging of vocal folds and the vocal tract, altering acoustic and prosodic patterns over time. This "slow, inevitable drift" degrades the reliability of speaker verification systems and challenges models designed for automatic age estimation.

Despite the recognized importance of this issue, existing datasets suffer from a trilemma:

Lack of Longitudinal Depth: Most high-quality datasets (e.g., TIMIT) are cross-sectional, capturing speakers only once.
Scale vs. Resolution: Large-scale "in-the-wild" datasets (e.g., VoxCeleb2) lack verified ground-truth demographic labels or sufficient temporal resolution.
Label Noise: Recent longitudinal efforts (e.g., VoxAging) rely on estimated labels (e.g., facial recognition) rather than verified metadata, introducing noise that confounds benchmarking.

There is a distinct lack of a resource that simultaneously offers dense, repeated recordings of the same speakers over many years, large scale, and verified ground-truth demographic labels, particularly for Hebrew, a morphologically rich but under-resourced language.

2. Methodology: The VoxKnesset Dataset

The authors introduce VoxKnesset, an open-access dataset derived from 16 years (2009–2025) of Israeli Knesset (parliamentary) plenary recordings.

Data Curation Pipeline

Source: Official audiovisual recordings and timestamped protocols from ~1,550 parliamentary sessions (totaling ~8,825 hours).
Alignment & Processing:
- Audio extracted at 16 kHz mono.
- Timestamps corrected for drift/jumps.
- Forced Alignment: Performed using Whisper and a Hebrew-adapted variant (Stable-Whisper) to align audio with word-level transcripts.
- Quality Control: Segments filtered by alignment confidence scores and minimum duration (30s).
Speaker Attribution: Leveraged the Knesset Corpus to match transcripts with verified demographic metadata (birth year, gender, country of origin, religion) from official parliamentary records.
Final Subset:
- 2,307 hours of speech.
- 393 unique speakers (Members of Knesset).
- Longitudinal Span: Up to 15 years between recordings for the same individual (median span: 3.4 years).
- Demographics: Verified labels for age, gender, religion, and birthplace.

3. Key Contributions

Dataset Release: The first large-scale, longitudinal Hebrew speech dataset with verified demographic labels and high-quality aligned transcripts.
Longitudinal Benchmarking: A comprehensive evaluation of modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) under aging conditions, quantifying performance degradation as the time gap between enrollment and testing increases.
Cross-Dataset Analysis: Evaluation of age prediction capabilities across four corpora (TIMIT, HPP-Voice, AgeVoxCeleb, VoxKnesset), demonstrating the transferability of age signals across languages and domains.

4. Experimental Results

A. Demographic Signal Validation

Using pretrained WavLM-Large embeddings, the authors verified that the dataset captures meaningful demographic variation:

Gender: Near-perfect prediction (99.9% accuracy).
Religion & Birthplace: High accuracy (94.8% and 84.9% respectively), with specific success in identifying Arabic-accented and Russian-accented speakers, suggesting the embeddings capture L1 transfer features.

B. Age Prediction (Cross-Sectional vs. Longitudinal)

Within-Dataset: WavLM-Large achieved the best performance (MAE: 4.6–7.3 years across datasets). VoxKnesset yielded an MAE of 6.3 years, comparable to English datasets despite language differences.
Cross-Corpus Transfer (LODO): VoxKnesset proved to be the most transferable target, showing the smallest domain gap ( $\Delta R^2 = 0.09$ ) when training on other datasets and testing on it.
The Aging Signal Paradox:
- Cross-Sectional Models: When trained on absolute age and applied longitudinally, they failed to capture within-speaker aging. Predicted age differences plateaued at 1–2 years regardless of the actual time elapsed.
- Longitudinal Models: Models trained specifically on paired embeddings from the same speaker over time (predicting elapsed time) successfully recovered a meaningful temporal signal. Wav2Vec2-XLSR-1B and WavLM-Large showed monotonic scaling with true elapsed time, whereas ECAPA-TDNN remained flat.

C. Speaker Verification Degradation

The study quantified the practical impact of aging on speaker verification (SV):

Performance Drop: Even for the strongest models, the Equal Error Rate (EER) more than doubles over a 15-year gap.
- Example: For the best model, EER rose from 2.15% (short gap) to 4.58% (15-year gap).
Implication: Current SV systems are highly sensitive to the "enrollment-test" time gap, necessitating aging-aware re-enrollment strategies.

5. Significance and Future Impact

Hebrew NLP: VoxKnesset significantly advances Hebrew speech processing, providing a rare resource for a morphologically rich language that has been underserved compared to English.
Biometric Security: The dataset provides empirical evidence that standard speaker verification systems degrade significantly over time, highlighting the need for aging-robust biometric systems and adaptive re-enrollment protocols.
Scientific Utility: By providing verified ground truth, VoxKnesset allows researchers to disentangle biological aging from channel drift (recording quality changes), a challenge that has plagued previous web-sourced longitudinal studies.
Open Science: The authors publicly release the dataset and the full processing pipeline to support the development of aging-aware speech technologies.

Conclusion

VoxKnesset fills a critical gap in speech research by offering a large-scale, verified, longitudinal dataset. The experiments demonstrate that while general-purpose embeddings encode age, they often fail to model within-speaker aging without specific longitudinal training. Furthermore, the dataset proves that vocal aging is a major source of error in speaker verification, doubling error rates over 15 years, which underscores the urgent need for systems designed to handle temporal drift.