MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

Imagine you are a chef who just cooked a new dish. To see if it's good, you ask a group of people to taste it and give it a score from 1 to 5. You take all their scores, add them up, and divide by the number of people to get the Average Score. This average is supposed to be the "truth" about how good the food is.

This paper is about a hidden problem with that "Average Score" when it comes to judging computer-generated voices (like Siri or Alexa). The researchers discovered that the "Average" isn't actually neutral—it secretly leans toward how men hear things, often ignoring how women hear things.

Here is the story of their discovery, explained simply:

1. The Hidden Bias: The "Louder" Voice

The researchers looked at thousands of voice samples and the scores given by male and female listeners. They found a surprising pattern:

Men consistently gave higher scores than women.
If a voice sounded "okay" but not great, men might say, "It's a 3.5!" while women would say, "It's a 2.5."
The Analogy: Imagine two groups of people grading a movie. The "Men's Club" is very generous with their stars, while the "Women's Club" is more critical. If you mix their scores together, the final grade looks like a "B," but it actually hides the fact that the women thought it was a "C."

The Twist: This gap wasn't the same for every voice.

Bad Voices: When the voice sounded terrible (robotic, glitchy), the gap was huge. Men were much more forgiving than women.
Good Voices: When the voice sounded perfect, both men and women agreed, and the gap disappeared.
Why it matters: You can't just "fix" this by adding a simple math correction (like subtracting 0.5 from men's scores) because the gap changes depending on how bad the voice is. It's a moving target.

2. The Robot's Mistake: Learning the Wrong Standard

Next, the researchers trained a computer (an AI) to predict these scores automatically. They fed the AI the "Average Scores" (the mixed bag of men and women).

The Result: The AI learned to be a "Male Listener." Even though the AI didn't know the gender of the people who rated the voices, it started predicting scores that matched the men's opinions much better than the women's.
The Metaphor: Imagine a student trying to guess the answer to a test by looking at the teacher's key. But the teacher's key was actually written by a committee where the men's opinions were louder. The student studies hard and becomes an expert at guessing what the men think, but they fail to understand what the women think. The AI became biased without even knowing it.

3. The Solution: The "Split-Brain" AI

To fix this, the researchers built a new kind of AI, which they call MOS-Bias.

How it works: Instead of asking the AI to give one single score, they gave it a "Split-Brain" architecture.
- Brain A tries to predict the "Average" score.
- Brain B is told to pretend it is a specific group (like a man) and predict what that group would score.
- Brain C is told to pretend it is the other group (like a woman) and predict their score.
The Magic: They didn't tell the AI "This is a man" or "This is a woman" explicitly. Instead, they gave it two secret codes (0 and 1) and let the AI figure out on its own that "Code 0" acts like women and "Code 1" acts like men.
The Outcome: The AI got smarter. It learned that men and women hear things differently. Because it understood these two different perspectives, it actually got better at predicting the average score too!

Why Should You Care?

This paper is a wake-up call for the tech world.

Fairness: If we only use "Average" scores to judge voice technology, we might be building products that sound great to men but terrible to women, and we won't even know it.
Better Tech: By acknowledging that people hear differently, we can build AI that is fairer and more accurate for everyone, not just one group.

In a nutshell: The researchers found that the "Average" score for voice quality is secretly biased toward men. They proved that computers trained on these averages learn this bias automatically. Their solution is to teach computers to understand that men and women have different "ears," which makes the computers smarter and the technology fairer for everyone.

Here is a detailed technical summary of the paper "MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment."

1. Problem Statement

The Mean Opinion Score (MOS) is the gold standard for subjective speech quality assessment in tasks like Text-to-Speech (TTS), Voice Conversion (VC), and Speech Enhancement. However, the paper identifies a critical, overlooked issue: systematic gender bias in human annotations and the subsequent propagation of this bias into automated prediction models.

The Core Issue: Standard MOS labels are calculated by averaging ratings from all listeners regardless of demographic. The authors hypothesize that if male and female listeners have systematically different perception standards, the resulting "average" score is not gender-neutral but implicitly favors one group.
The Gap: While demographic bias is well-studied in NLP and Computer Vision, it remains largely unexplored in speech quality assessment. Specifically, it is unknown whether listener gender creates a structural bias in MOS that automated models (trained on aggregated labels) inadvertently learn and propagate.

2. Methodology

A. Dataset and Bias Analysis

Dataset: The study utilizes the BVCC dataset (Blizzard Challenge, Voice Conversion Challenge, ESPnet-TTS), which is unique in providing metadata for both speaker and listener gender.
Analysis Approach:
- The authors separated listener ratings into Male ( $M$ ) and Female ( $F$ ) groups.
- They calculated gender-specific MOS ( $MOS_M$ and $MOS_F$ ) and compared them against the standard aggregated MOS.
- Statistical Validation: Welch's t-tests were used to confirm that differences were statistically significant ( $p < 0.001$ ) and not due to sampling variation, despite the dataset having slightly more female listeners (4.25) than male listeners (3.6) per clip on average.
- Quality Dependency: The analysis stratified data by speech quality (Poor to Excellent) to determine if the gender gap was a fixed offset or variable.

B. Baseline Evaluation (Bias Inheritance)

Model: The authors used SSL-MOS (a mainstream automated MOS prediction model) as a baseline.
Training: The model was trained only on aggregated (gender-agnostic) MOS labels, with no gender information provided as input.
Evaluation: Predictions were tested against three ground truths: All listeners, Male listeners only, and Female listeners only.
Finding: The model trained on "average" labels consistently predicted scores closer to male listeners' ratings than female listeners', even though the training set had more female raters. This confirmed that aggregated labels implicitly encode male perceptual standards.

C. Proposed Solution: Gender-Aware MOS Prediction

To address the bias, the authors proposed a Gender-Aware Model with the following architecture:

Architecture: A parallel branch design based on SSL-MOS.
- Shared Encoder: A single SSL encoder processes the audio.
- Mean Branch: Predicts the overall average MOS.
- Gender Branch: Predicts gender-specific MOS scores.
Input Mechanism (Key Innovation): Instead of concatenating explicit gender labels (e.g., "Male"/"Female") which might violate the model's neutrality or require demographic data at inference, the model uses abstract binary group embeddings (Group 0 and Group 1).
- The model is trained to autonomously learn that Group 1 corresponds to male rating patterns and Group 0 to female rating patterns based on the data distribution.
Training Objective: A multi-task loss function combining Mean Squared Error (MSE) for three targets:
$L_{total} = L_{avg} + L_{male} + L_{female}$
This ensures the model learns intrinsic speech quality while simultaneously modeling gender-specific deviations.

3. Key Results

A. Discovery of Gender Bias

Direction: Male listeners consistently assign higher scores than female listeners across all conditions.
Magnitude: The gap is most pronounced in low-quality speech (Mean difference $\approx 0.167$ for scores 1–2) and diminishes as quality improves (Mean difference $\approx 0.030$ for scores 4–5).
Implication: This quality-dependent structure means simple global calibration (adding a fixed offset) cannot correct the bias; the bias must be modeled dynamically.

B. Bias Propagation in Baseline Models

Automated models trained on aggregated labels exhibit a male-leaning bias.
System-Level MSE: The baseline model's error against female ground truth was 37.6% higher than against male ground truth, proving the model failed to represent female perception accurately.

C. Performance of the Gender-Aware Model

The proposed model outperformed the baseline in both overall and gender-specific metrics (tested on BVCC test set):

Overall Accuracy: When evaluated against all listeners, the Gender-Aware model achieved a higher Linear Correlation Coefficient (LCC: 0.862 vs. 0.853) and lower MSE (0.239 vs. 0.290) than the baseline.
Male-Specific Accuracy: LCC improved from 0.806 to 0.817; MSE dropped from 0.372 to 0.332.
Female-Specific Accuracy: LCC improved from 0.802 to 0.807; MSE dropped significantly from 0.430 to 0.366.
Conclusion: The multi-task learning approach not only corrected the bias but also improved the modeling of intrinsic speech quality for the main branch.

4. Key Contributions

First Systematic Evidence: Provided empirical proof that male listeners consistently rate speech higher than female listeners, with a gap that varies inversely with speech quality.
Bias Inheritance Demonstration: Showed that standard automated MOS models trained on aggregated labels implicitly inherit and propagate a male-centric perceptual bias, failing to represent female listeners accurately.
Gender-Aware Framework: Proposed a novel architecture using abstract binary group embeddings to learn gender-specific scoring patterns without requiring explicit demographic labels at inference time. This approach improved prediction accuracy for both the general population and specific gender groups.

5. Significance

Fairness in Evaluation: This study challenges the assumption that MOS is a gender-neutral metric. It highlights that current evaluation standards may systematically undervalue speech quality from the perspective of female listeners.
Methodological Shift: It suggests that future speech quality assessment tools must move beyond simple averaging and adopt demographic-aware modeling to ensure equitable evaluation.
Practical Impact: The proposed model offers a practical solution for developers to generate more accurate and fair quality scores, which is crucial for the development of inclusive TTS and voice conversion systems.
Future Directions: The work lays the groundwork for bias mitigation strategies in MOS labeling and encourages the community to re-evaluate fairness in speech processing metrics.