Are you sure? Measuring models bias in content… — Plain-Language Explanation

Original authors: Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci

Published 2026-03-12✓ Author reviewed ⓘ

📖 4 min read☕ Coffee break read

Original authors: Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a very strict, automated bouncer at the door of a giant digital party (social media). This bouncer's job is to decide who gets kicked out for being rude or hateful. Usually, we judge how good this bouncer is by counting how many times they correctly identified a rude person. If they get 90% right, we say, "Great job!"

But this paper asks a tricky question: What if the bouncer is actually very confident about the wrong people, and unsure about the right people?

Here is the story of the paper, broken down into simple concepts and analogies.

1. The Problem: The "Confident" Bouncer

The authors noticed that AI models (the bouncers) often make mistakes that look like bias. For example, a model might be very sure that a comment from a white man is "safe," but very unsure about a comment from a non-white woman, even if both comments are actually safe.

Usually, we only look at the score (how many people they got right). This paper says that's not enough. We need to look at the bouncer's confidence.

The Analogy: Imagine two students taking a test.
- Student A gets 80% right, but they are 100% sure about every answer.
- Student B gets 80% right, but they are shaking with nervousness on the questions about history, even though they guessed correctly.
- Student A feels "fair" and "reliable." Student B feels "risky." This paper argues that in content moderation, we need to catch Student B before they start making real-world decisions.

2. The New Tool: The "Uncertainty Meter"

The researchers invented a new way to measure the bouncer. Instead of just asking, "Did you get it right?" they ask, "How sure are you?"

They used a mathematical tool called Conformal Prediction. Think of this as a special "confidence badge" the bouncer wears.

If the badge is Green, the bouncer is very sure.
If the badge is Red, the bouncer is confused or unsure.

The researchers checked if the bouncer's "Red" and "Green" badges were distributed fairly among different groups of people (White men, White women, Non-white men, Non-white women).

3. The Experiment: Testing 11 Different Bouncers

They tested 11 different AI models (some are older, some are brand new "Large Language Models" like the ones you might chat with). They used two huge datasets of comments that were labeled by real humans from different backgrounds.

The Big Discovery:

The Score vs. The Feeling: They found that a model could have a high score (get many answers right) but still have a "Red Badge" (low confidence) when dealing with non-white people.
The Hidden Bias: The models were often confident when judging white men, but nervous and unsure when judging non-white people.
The Analogy: It's like a teacher who is very confident grading the essays of students from their own hometown, but second-guesses every single word written by a student from a different country, even if the student is writing perfectly. The teacher might still give the right grade, but that hesitation shows a hidden bias.

4. Grouping the Humans: The "Uncertainty Fingerprint"

The researchers also looked at the humans who labeled the data. They realized that different people see "hate speech" differently based on their background.

They created a "fingerprint" for each human annotator based on how much the AI disagreed with them.
They then grouped these humans into clusters.
The Result: Some AI models grouped all the humans together nicely (Fair). Other models created groups where the "Non-white women" were all stuck in a cluster where the AI was constantly confused and unsure (Unfair).

5. The Takeaway: Why This Matters

The paper concludes that measuring confidence is a better way to find bias than just measuring accuracy.

The Old Way: "Did the AI get the answer right?" (Sometimes yes, but it might have been a lucky guess or a biased guess).
The New Way: "Did the AI feel comfortable making that decision for everyone?"

If an AI is confident about white men but nervous about non-white women, it means the AI has been trained on data that doesn't represent non-white women well. Even if it gets the job done, it's a ticking time bomb for fairness.

Summary in One Sentence

This paper teaches us that to build truly fair AI bouncers for the internet, we shouldn't just count their correct decisions; we should listen to their nervousness, because that nervousness reveals who they are ignoring or misunderstanding.

1. Problem Statement

Automatic content moderation is essential for social media safety, yet Language Model (LM) classifiers often perpetuate racial and social biases against vulnerable groups (e.g., non-white people and women).

The Gap: Current fairness evaluation relies heavily on performance metrics like the F1 score. However, high accuracy does not guarantee fairness; models may achieve high F1 scores while remaining systematically uncertain or misaligned with the perspectives of minority groups.
The Challenge: Measuring model bias in subjective tasks (like hate speech detection) is difficult because "ground truth" is often an aggregation of diverse human opinions. Standard metrics fail to capture the confidence or uncertainty a model has when predicting labels provided by specific socio-demographic groups.

2. Methodology

The authors propose an unsupervised approach using Conformal Prediction to quantify model uncertainty as a proxy for bias. They evaluate 11 models (8 fine-tuned LMs and 3 zero-shot LLMs) on two disaggregated hate speech corpora: SBIC (Social Bias Inference Corpus) and CREHate.

Core Framework: Conformal Prediction

Instead of standard probability scores, the authors use Conformal Prediction to assess how well model predictions align with observed outcomes.

Brier Score ( $b$ ): Used as a conformity score to measure the alignment between the model's predicted probability distribution and the true label. Lower scores indicate better conformity (less uncertainty).
Conformity Delta ( $\Delta$ ): Measures the variability in model confidence when comparing an individual annotator's label against the aggregated (majority vote) gold standard.
- $\Delta = b(\text{individual label}) - b(\text{aggregated label})$
- A high $\Delta$ indicates the model is significantly more uncertain about an individual's perspective than the consensus.

Proposed Metrics

To analyze bias across four socio-demographic groups (White Men, White Women, Non-White Men, Non-White Women), the authors introduce two metrics:

Uncertainty Divergence:
- Converts conformity deltas into a distribution (negative, zero, positive).
- Computes the Kullback-Leibler (KL) Divergence between the distribution of the total dataset and the distribution of a specific demographic group.
- Goal: To determine if a model exhibits systematically higher uncertainty for specific groups compared to the general population.
Demographic Divergence:
- Represents each annotator as a 40-dimensional vector based on the frequency of their uncertainty values (bins from -1 to 1).
- Clusters annotators using K-Means based on these uncertainty profiles.
- Computes the Jensen-Shannon Divergence (JSD) of demographic distributions across the resulting clusters.
- Goal: To assess if the model's uncertainty naturally segregates annotators by demographics. Low divergence implies fairness (uncertainty is not driven by demographics); high divergence implies bias.

3. Key Contributions

Unsupervised Bias Detection: Introduced a method to benchmark model fairness using uncertainty rather than labeled ground truth, leveraging Conformal Prediction.
Novel Metrics: Defined Uncertainty Divergence and Demographic Divergence to quantify the alignment between model confidence and socio-demographic groups.
Comprehensive Benchmark: Evaluated 11 state-of-the-art models (including BERT-based fine-tuned models and large LLMs like Mistral, Olmo, and Bloom) on two major perspectivist datasets.
User Representation: Demonstrated that annotators can be effectively represented and clustered by their "uncertainty fingerprints," revealing hidden subgroups and biases.

4. Key Results

RQ1: Is uncertainty a predictor of bias?

Decoupling of Accuracy and Fairness: There is no correlation between F1 scores and Uncertainty Divergence (p-values > 0.1). Models with high F1 scores can still exhibit high uncertainty (misalignment) with minority groups.
Systematic Bias: Most models show significantly higher uncertainty (lower conformity) when predicting labels from non-white people compared to white people.
LLM Behavior: Large Language Models (LLMs) generally exhibit higher average uncertainty than fine-tuned LMs, suggesting they are less calibrated for specific demographic perspectives despite their general capabilities.
Gender Patterns: Models often perform better (lower uncertainty) on labels provided by women compared to men in specific datasets, though race remains the dominant factor in divergence.

RQ2: Can fairness be assessed via user representation?

Clustering Insights: When annotators are clustered by uncertainty, the resulting groups often segregate by demographics.
Model Comparison:
- Mistral-7B: Achieved the best trade-off, showing low Uncertainty Divergence and low Demographic Divergence, indicating it is relatively fair across gender and ethnicity.
- MuRIL: Showed the lowest overall uncertainty but the highest Demographic Divergence, meaning it is very confident but systematically biased (high confidence in wrong predictions for specific groups).
- Olmo-7B & Bloom: Showed negative Demographic Divergence values, indicating uneven distribution of uncertainty across demographics.

5. Significance and Implications

Beyond F1 Scores: The paper argues that relying solely on performance metrics (F1) masks hidden biases. A model can be "accurate" on average while failing to respect the perspectives of vulnerable minorities.
Pre-training Biases: The systematic uncertainty regarding non-white annotators suggests that pre-training data lacks diverse perspectives, embedding biases that fine-tuning alone cannot fully remove.
Guiding Model Selection: Uncertainty metrics can serve as a "blueprint" for selecting content moderation models that are safer for diverse populations before deployment.
Limitations: The study is limited to binary gender/ethnicity categories (excluding non-binary identities) and relies on existing datasets that may have their own annotation biases.

Conclusion

The authors conclude that uncertainty is a powerful, unsupervised indicator of bias. By measuring how confident a model is when interpreting the views of different socio-demographic groups, researchers can identify and mitigate systemic discrimination in content moderation systems that traditional accuracy metrics fail to detect.

Are you sure? Measuring models bias in content moderation through uncertainty