Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Imagine you have a very smart, super-advanced robot detective. Its job is to listen to people talking and figure out two things:

Is this person's brain working normally, or are they showing early signs of memory loss (Cognitive Impairment)?
Is this person feeling depressed?

This paper is like a report card for that robot detective. The researchers built the detective using a modern "self-supervised" learning system (called Wav2Vec 2.0), which is like teaching the robot by letting it listen to thousands of hours of random speech without a teacher telling it what to do.

Here is the breakdown of what they found, using some everyday analogies:

1. The Robot is Great at the Main Job, But Has "Blind Spots"

The robot detective is incredibly good at spotting memory loss. In fact, it was much better than the old-school methods (which were like using a magnifying glass to look for clues). The new robot could correctly identify memory issues about 80% of the time.

However, the robot isn't fair to everyone. It has "blind spots" based on who is talking:

The Gender Gap: The robot is much better at analyzing men's voices than women's voices.
- Analogy: Imagine a security guard who is excellent at spotting a thief in a red hat but keeps missing the thief in a blue hat. The robot often thought healthy women were sick (false alarms) and missed some sick women. It was like the robot's "ears" were tuned to a frequency that men's voices hit perfectly, but women's voices often missed.
The Age Gap: The robot is better at analyzing older adults (65+) than younger adults (under 65).
- Analogy: Think of the robot as a historian who has read a million books about the 1950s but has never read a book about the 2020s. When an older person speaks, the robot recognizes the patterns easily. When a younger person speaks, the robot gets confused because the "acoustic patterns" of memory loss look different in younger people, and the robot hasn't learned that language well.

2. The "Depression" Problem

When the researchers asked the robot to detect depression in people who already had memory issues, the robot struggled.

Analogy: It's like asking a chef who is a master at baking cakes to suddenly make a perfect soufflé. The tools are similar (the kitchen), but the techniques are totally different. The robot got the "cake" (memory loss) right, but it kept burning the "soufflé" (depression). The signals for depression in speech are much subtler and harder to catch.

3. Why Did This Happen? (The Training Data)

The researchers found that the robot learned these biases because of what it was trained on.

Analogy: Imagine you are teaching a child to recognize animals. If you only show them pictures of dogs from one specific breed (say, Golden Retrievers), the child will think all dogs look like Golden Retrievers. If they see a Chihuahua, they won't recognize it.
The robot was trained on a massive dataset of speech (mostly from English speakers), but that dataset likely had more men and older people than women and younger people. The robot learned the "average" voice, which happened to sound more like the majority group. When a minority group (like younger women) spoke, the robot didn't have a good reference point, so it made mistakes.

4. The "Cross-Over" Test

The researchers tried to see if the robot could use its memory-loss skills to detect depression, and vice versa.

Result: It failed completely.
Analogy: It's like trying to use a map of the ocean to navigate a mountain trail. Even though both involve "travel," the terrain is so different that the map is useless. The sounds of depression and the sounds of memory loss are distinct; you can't just swap the tools.

The Big Takeaway

This paper is a wake-up call for the medical world.

The Good News: We have powerful AI tools that can help detect Alzheimer's and memory loss early, which is a huge step forward.
The Bad News: If we just plug these tools into hospitals without checking, they might work great for older men but fail for younger women or people with depression. This could lead to unfair healthcare, where some people get diagnosed and others don't, simply because of their gender or age.

The Conclusion: Before we let these AI doctors take over, we need to make sure they are "fair." We need to train them on more diverse groups of people so they don't have blind spots. We can't just look at the overall accuracy score; we have to check if the robot is treating everyone equally.

1. Problem Statement

The paper addresses the critical issue of demographic and clinical bias in speech-based machine learning (ML) models used for detecting Cognitive Impairment (CI) and comorbid depression. While self-supervised learning (SSL) models like Wav2Vec 2.0 (W2V2) have shown superior performance in speech tasks compared to traditional acoustic features, their fairness across different subgroups (gender, age, clinical status) remains underexplored.

The Gap: Existing studies often focus solely on overall accuracy, ignoring whether models perform equitably across subgroups. In clinical settings, performance disparities can lead to misdiagnosis, reinforcing health inequities.
The Specific Challenge: The study investigates whether state-of-the-art embeddings inherit biases from pre-training data (e.g., LibriSpeech) and how these biases affect the detection of CI and depression, particularly in the context of the DementiaBank Pitt Corpus.

2. Methodology

Dataset

Source: DementiaBank Pitt Corpus ("Cookie Theft" picture description task).
Subjects: 229 participants (139 CI/Alzheimer's patients, 90 Cognitively Normal controls).
Labels:
- CI Status: Determined by Mini-Mental State Examination (MMSE) scores (<24 = CI).
- Depression Status: Determined by Hamilton Depression Rating Scale (HAM-D) scores (≥8 = Depressed).
Preprocessing: Audio was denoised and normalized. To address class imbalance, the authors created three dataset configurations:
1. Imbalanced: Original distribution.
2. CI-Balanced: Balanced CI vs. NCI.
3. CI-Gender Balanced: Balanced CI vs. NCI and Male vs. Female.
  Note: Age balancing was not feasible due to the natural predominance of AD in older adults.

Feature Extraction

Three types of acoustic representations were compared:

Traditional Baselines:
- MFCCs: 40-dimensional Mel Frequency Cepstral Coefficients (averaged over time).
- eGeMAPS: Extended Geneva Minimalistic Acoustic Parameter Set (88 parameters including pitch, jitter, shimmer, spectral features).
Self-Supervised Embeddings:
- Wav2Vec 2.0 (W2V2): Specifically the Wav2Vec2-Base-960h model.
- Layers Analyzed: Both latent layers (convolutional encoder outputs) and hidden layers (transformer encoder states, layers 1–12) were extracted and aggregated.

Classification & Optimization

Tasks:
1. CI vs. NCI: Detecting cognitive impairment.
2. Depressed CI (D-CI) vs. Non-Depressed CI (ND-CI): Detecting depression within the CI population.
Classifiers: Radial Basis Function-SVM (RBF-SVM), Random Forest (RF), and Multi-Layer Perceptron (MLP).
Evaluation Metrics: Accuracy, Unweighted Average Recall (UAR), Sensitivity, Specificity, and Area Under the Curve (AUC).
Bias Analysis Metrics:
- Intra-group Imbalance ( $\delta$ ): Difference between subgroup-specific specificity and sensitivity ( $\delta = Sp - Se$ ).
- Inter-group Disparity ( $\Delta$ ): Difference in performance metrics (Sensitivity/Specificity) between subgroups (e.g., Male vs. Female).
- AUC Analysis: To distinguish between calibration issues and fundamental representational bias.

3. Key Contributions

Systematic Bias Analysis: This is the first comprehensive study to evaluate demographic (age, gender) and clinical (depression) biases in SSL-based acoustic models for CI detection.
Feature Comparison: A rigorous comparison showing that while W2V2 embeddings outperform traditional features (MFCCs, eGeMAPS) in overall accuracy, they introduce significant subgroup disparities.
Task Specificity: Demonstration that CI and depression rely on distinct acoustic representations, with poor cross-task generalization.
Fairness Framework: Introduction of a detailed bias evaluation protocol using $\delta$ , $\Delta$ , and AUC to isolate algorithmic bias from thresholding artifacts.

4. Key Results

A. Classification Performance

CI Detection:
- W2V2 Superiority: Higher-layer W2V2 embeddings (Layers 9 and 10) significantly outperformed baselines.
  - Best Result: W2V2 Layer 9 + MLP on Imbalanced data achieved 80.6% UAR.
  - Baselines: MFCCs reached ~~66% UAR; eGeMAPS performed poorly (~~50% UAR) with very low specificity.
- Robustness: W2V2 performance remained stable across balanced and imbalanced datasets.
Depression Detection:
- Overall Difficulty: Performance was suboptimal across all features (UARs ~50–61%).
- Layer Dependency: Unlike CI detection, lower/mid-level layers (e.g., Layer 2) performed better for depression, suggesting depression markers (prosody, pitch) are encoded earlier in the network.
- Cross-Task Generalization: Models trained on CI failed to detect depression and vice versa (performance near chance), indicating distinct acoustic signatures.

B. Bias and Fairness Analysis (W2V2 Layer 9)

Significant disparities were found across subgroups, even after data balancing:

Gender Bias:
- Males: Classified with higher Specificity (better at identifying healthy controls).
- Females: Classified with higher Sensitivity (better at identifying CI, but at the cost of false positives).
- Disparity: Specificity gap ( $\Delta_{spec}$ ) up to 18%. Females had lower discriminative power (AUC: 0.769) compared to males (AUC: 0.824), indicating inherent algorithmic bias where the model struggles to distinguish classes in female speech.
Age Bias:
- Older Adults (Group 2): Showed aggressive bias toward CI detection (high sensitivity).
- Younger Adults (Group 1): Showed conservative bias toward NCI detection.
- Disparity: Specificity gap up to 15%. Younger participants had significantly lower AUC (0.746 vs. 0.831), suggesting acoustic features for CI are less representative in younger populations.
Depression Status Bias:
- Depressed Participants: Exhibited a strong conservative bias toward NCI detection (high specificity, low sensitivity).
- Paradox: Despite the bias, the model showed exceptional discriminative power for depressed individuals (AUC: 0.950) compared to non-depressed (AUC: 0.768). This suggests the model easily separates depressed from non-depressed within the CI group, but the decision threshold is skewed.

5. Significance and Conclusion

Clinical Implications: The study highlights that high overall accuracy (e.g., 80% UAR) masks severe inequities. Deploying such models without subgroup analysis risks misdiagnosing females and younger patients, potentially delaying intervention.
Limitations of Data Balancing: Simply balancing the training data (CI/Gender) did not eliminate the representational biases, suggesting the bias is embedded in the model's learned representations (likely inherited from pre-training data like LibriSpeech).
Future Directions:
- Need for fairness-aware model development and evaluation protocols.
- Requirement for larger, diverse datasets with clinically validated depression annotations.
- Exploration of debiasing techniques and alternative SSL architectures (e.g., HuBERT, WavLM).
Final Takeaway: While self-supervised models like W2V2 advance clinical speech analysis, they are not "fair" by default. They encode systemic biases that must be explicitly quantified and mitigated before clinical deployment.