Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

Imagine you have a group of very smart, super-fast robots (Large Language Models, or LLMs) that have read almost everything on the internet. They are great at writing stories, answering questions, and giving advice. But, just like humans, they can pick up bad habits and stereotypes from what they read.

This paper is like a detective story where the authors go to Nepal to see if these robots are fair when talking about Nepali culture, or if they are secretly carrying around old-fashioned, unfair ideas about gender, race, and social status.

Here is the breakdown of their investigation using simple analogies:

1. The Problem: The "Western Lens"

Most of these robots were trained on data from the US and Europe. Imagine trying to judge a Nepali village using a rulebook written for New York City. It doesn't fit. The authors noticed that while we know these robots are biased in English, we don't know how they act in Nepali, a place with 120+ languages and a complex social structure involving castes and ethnic groups.

2. The Tool: A New "Bias Test" (EquiText-Nepali)

To test the robots, the authors built a custom exam called EquiText-Nepali.

The Analogy: Imagine a "Spot the Difference" game. They created over 2,400 pairs of sentences.
- Pair A (The Stereotype): "Men are naturally better at farming than studying."
- Pair B (The Truth): "Many men excel in both farming and studying."
They asked the robots to read these pairs and decide which one they agree with.

3. The Two-Part Test: The "Dual-Metric" Approach

The authors realized that asking a robot "Do you agree?" isn't enough. People (and robots) might say the right thing but do the wrong thing. So, they used a two-part test:

Part 1: The "Yes/No" Interview (Explicit Agreement)
- The Analogy: You ask the robot, "Do you believe women make bad engineers?"
- What they measured: How often does the robot say "Yes"? This is Explicit Bias. It's what the robot says it believes.
Part 2: The "Finish the Sentence" Game (Implicit Completion)
- The Analogy: You give the robot the start of a sentence: "In Nepal, Dalits are..." and let it finish the story on its own.
- What they measured: Does the robot automatically finish the sentence with something negative or stereotypical, even if you didn't ask it to? This is Implicit Bias. It's what the robot does when it's not being watched.

4. The Big Discovery: The "Split Personality"

The results were surprising.

The Robots were "Polite" but "Prejudiced": When asked directly (Part 1), the robots didn't agree with stereotypes very often (about 36–43% of the time). They seemed fairly neutral.
But they "Slipped Up" constantly: When asked to write a story or finish a sentence (Part 2), they fell back into stereotypes 74–75% of the time.
The Metaphor: It's like a person who says, "I don't believe in racism," but when they tell a joke or write a story, they accidentally use racist tropes. The bias is hidden deep in their "muscle memory," not just in their stated opinions.

5. The "Temperature" Knob

The researchers also played with the robot's "creativity knob" (called Temperature).

Low Temperature (Strict): The robot is very logical and repetitive.
High Temperature (Chaotic): The robot is wild and creative.
The Finding: They found a weird "U-shape." The robots were actually most likely to be stereotypical when they were "moderately creative" (Temperature 0.3). When they were either very strict or very wild, the bias dropped slightly. This means you can't just "turn up the chaos" to fix the problem.

6. Why This Matters

The authors found that race and social caste biases were the hardest to fix in the robots' writing, even more than gender bias. This suggests that the data the robots learned from (the internet) has a lot of hidden prejudice against these specific groups in Nepal.

The Takeaway

This paper is a wake-up call. It tells us that:

We can't trust robots to be fair just because they say they are. We have to watch what they write, not just what they say.
One size does not fit all. A robot trained on American data might be fair in the US but very unfair in Nepal.
We need local experts. To fix this, we need datasets and tests built by people who actually live in those cultures, not just imported from the West.

In short, the authors built a mirror to show the robots their own reflection in a Nepali context, and the reflection showed that they still have a lot of work to do to be truly fair.

Here is a detailed technical summary of the paper "Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context."

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed globally, yet their potential to perpetuate social and cultural biases is poorly understood in underrepresented, low-resource contexts, particularly in the Global South. Existing bias evaluation research predominantly focuses on Western contexts and English-language datasets, failing to capture the nuances of diverse sociocultural structures like caste, ethnicity, and regional religious practices found in countries such as Nepal.

The Gap: Current benchmarks (e.g., StereoSet, BOLD) lack cultural granularity for Nepali society. Furthermore, prior studies often examine bias in isolation (either explicit agreement or generative output) rather than jointly, and few studies have analyzed how decoding parameters (temperature, top-p) influence bias expression in non-Western contexts.
The Objective: To systematically evaluate gender, racial, and sociocultural biases in state-of-the-art LLMs within the Nepali cultural context using a novel, culturally grounded dataset and a dual-metric evaluation framework.

2. Methodology

A. Dataset Construction: EquiText-Nepali

The authors introduced EquiText-Nepali, a Croissant-compliant dataset containing over 2,400 sentence pairs.

Structure: Each pair consists of a stereotypical sentence and an anti-stereotypical counterpart.
Domains: The data spans three primary bias categories:
1. Gender: Professional, educational, and political roles.
2. Race/Ethnicity: Community stereotypes and ethnic dynamics.
3. Sociocultural: Caste-based discrimination, interfaith relations, and religious practices.
Annotation: The dataset was constructed using empirical research (e.g., Nepal's 2021 Census) and validated by a diverse team of Nepali-English bilingual annotators. It achieved a Label Validation Agreement (LVA) of ~92.1%, ensuring cultural accuracy and reliability.

B. Evaluation Framework: Dual-Metric Bias Assessment (DMBA)

The study proposes a framework that measures bias through two complementary mechanisms:

Explicit Agreement (Belief Bias):
- Models are prompted to rate their agreement with stereotypical vs. anti-stereotypical statements on a scale of 0–100.
- Metric: Bias_agreement calculates the proportion of pairs where the model prefers the stereotypical statement ( $A_{stereo} > A_{anti-stereo}$ ) and the magnitude of this preference.
Implicit Completion (Behavioral Bias):
- Models are given truncated prompts (first 6 tokens of a stereotypical sentence) and asked to generate open-ended continuations (up to 200 tokens).
- Metric: Bias_completion determines if the generated text aligns more closely with the stereotypical reference or the anti-stereotypical reference using Cosine Similarity on TF-IDF vector representations. A threshold ( $\tau=0.7$ ) classifies the output as stereotypical or not.

C. Experimental Setup

Models Evaluated: Seven state-of-the-art LLMs (Proprietary: GPT-4o-mini, Claude-3/4-Sonnet, Gemini-2.0-Flash/Lite; Open-source: Llama-3-70B, Mistral-Nemo).
Decoding Configurations: Experiments were run across three settings to test robustness:
1. Deterministic (Temperature $T=0$ , Top-p $=1.0$ ).
2. Temperature Sampling ( $T=0.7$ , Top-p $=1.0$ ).
3. Nucleus Sampling ( $T=0.7$ , Top-p $=0.85$ ).
Sensitivity Analysis: A stratified subset was used to analyze the impact of varying $T$ (0.0 to 1.0) and Top-p (0.3 to 1.0) on bias metrics.

3. Key Contributions

EquiText-Nepali Dataset: The first culturally grounded, Croissant-compliant benchmark specifically designed to evaluate intersectional biases (gender, race, caste) in the Nepali context.
Dual-Metric Framework (DMBA): A novel methodology that jointly quantifies explicit agreement and implicit generative tendencies, revealing that these two dimensions are often decoupled.
Comprehensive Sensitivity Analysis: A systematic investigation into how decoding parameters (temperature and top-p) influence bias expression, demonstrating that bias is not static but parameter-dependent.
Empirical Evidence from the Global South: Provides critical data on LLM fairness in underrepresented linguistic and cultural settings, challenging the assumption that Western-trained models generalize well to diverse societies.

4. Key Results

A. Explicit vs. Implicit Bias Divergence

Higher Implicit Bias: Models exhibited significantly higher rates of implicit generative bias (0.740–0.755) compared to explicit agreement bias (0.36–0.43).
Weak Correlation: Correlation analysis revealed that explicit agreement is a weak and often negative predictor of implicit completion bias. A model may explicitly reject a stereotype in a rating task but still generate stereotypical content in open-ended tasks. This highlights the insufficiency of single-metric evaluations.

B. Domain-Specific Findings

Race & Sociocultural Bias: Implicit completion bias was strongest for race and sociocultural (caste/religion) stereotypes, suggesting these biases are deeply embedded in pretraining corpora.
Gender Bias: Explicit agreement bias was comparably high across gender and sociocultural categories, while race showed the lowest explicit agreement rates.

C. Impact of Decoding Parameters

Temperature ( $T$ ):
- Explicit Bias: Increases linearly with temperature (Mean agreement rose from 0.36 at $T=0$ to 0.43 at $T=1.0$ ).
- Implicit Bias: Follows a non-linear, U-shaped relationship. Bias completion rates peaked at moderate stochasticity ( $T=0.3$ , rate $\approx 0.755$ ) and declined slightly at higher temperatures ( $T=1.0$ , rate $\approx 0.740$ ).
Top-p (Nucleus Sampling):
- Increasing Top-p amplified explicit bias (Mean agreement rose from 0.364 to 0.414).
- Implicit generative bias remained largely stable across Top-p variations, indicating that generative tendencies are robust to sampling constraints.

5. Significance and Conclusion

Methodological Advancement: The study proves that relying solely on explicit agreement metrics (like prompt-based ratings) fails to capture the full scope of LLM bias, particularly the "hidden" biases that emerge during generation. The DMBA framework offers a more holistic view.
Cultural Relevance: The findings underscore that models trained on Western data struggle with Nepali sociocultural nuances, particularly regarding caste and ethnic hierarchies, often reproducing harmful stereotypes even when explicitly prompted to avoid them.
Implications for AI Safety: The decoupling of explicit and implicit bias suggests that safety alignment strategies must address both stated beliefs and generative behaviors. Furthermore, the sensitivity of bias to decoding parameters implies that "safe" outputs in one configuration may become biased in another.
Future Directions: The authors call for the development of native-language datasets (beyond English prompts) and advanced debiasing techniques that incorporate specific cultural contexts, particularly for underrepresented regions in the Global South.

Limitations Noted: The study used English prompts for cross-model compatibility, which may underestimate biases present in native Nepali contexts. Additionally, the dataset is static and reflects norms at a specific point in time.