Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Imagine you have a group of very smart, well-read robots (Large Language Models, or LLMs) that have read almost everything on the internet. You might think they know everything about everyone. But, as this paper reveals, these robots have a blind spot: they know the "Global North" (like the US and Europe) very well, but they are often clueless about Latin America.

Here is a simple breakdown of what the researchers did, why it matters, and what they found, using some everyday analogies.

1. The Problem: The "Tourist Guide" vs. The "Local"

Think of these AI models as tourist guides. Most of them were trained on books and websites written in English or by people in North America and Europe.

The Issue: If you ask a tourist guide about the local food in a small village in Chile or the slang used in a specific neighborhood in Mexico, they might guess wrong or give you a generic answer that sounds like it belongs in Spain, not Latin America.
The Gap: There wasn't a good "test" to see how well these robots actually understood the specific cultures, traditions, and inside jokes of Latin American countries. Existing tests were either too broad (grouping all of Latin America into one big bucket) or didn't exist in Spanish or Portuguese.

2. The Solution: Building "LatamQA" (The Cultural Trivia Night)

To fix this, the researchers built a massive Cultural Trivia Night called LatamQA.

Where did the questions come from? They didn't hire a thousand people to write questions from scratch (which is slow and expensive). Instead, they used Wikipedia like a giant library.
The Recipe:
1. The Library: They went into the "Culture" section of Wikipedia for 20 different Latin American countries.
2. The Filter: They used a sociologist (a human expert on society) to act as a bouncer, making sure they only picked articles about real culture (like food, festivals, local slang, and famous characters) and not boring lists like "all the players on a soccer team."
3. The Chef: They used an AI to cook up 23,000 multiple-choice questions based on those articles.
4. The Translation: They made sure the questions were in Spanish, Portuguese, and English so they could test the robots in their "native" languages and in translation.

3. The Experiment: Testing the Robots

The researchers put various AI models (from small ones to huge, super-smart ones) through this trivia test. It was like a school exam where the subject is "Latin American Culture."

4. The Results: What the Robots Got Wrong

The test revealed three surprising things:

The "Home Court" Advantage: The robots did much better when the questions were in their "native" language (Spanish or Portuguese) than when they had to answer in English. It's like how you might understand a joke better in your own language than in a translation.
The "Spain vs. Latin America" Bias: Even though Spain and Latin America speak similar languages, the robots knew Spain much better.
- Analogy: Imagine a student who studied hard for a test on "British History" but barely opened the book on "American History." Even though they are both English-speaking, the student knows the British stuff perfectly and gets the American stuff wrong. The robots treat Latin American culture as if it were just a "distant cousin" of Spanish culture, rather than its own unique thing.
Size Matters (But Not Always): Bigger, smarter robots generally got more questions right. However, even the biggest robots struggled with specific, deep cultural details (like obscure dialects or fictional characters from local novels).

5. Why This Matters

This paper is like holding up a mirror to the AI world.

It shows that if we want AI to be fair and useful for everyone, we can't just train it on data from the US and Europe.
It proves that "one size fits all" doesn't work for culture. A robot that knows everything about New York might know nothing about the Fiesta de la Tirana in Chile or the specific slang of a flaite in Santiago.
The Takeaway: To build truly smart AI, we need to build better "libraries" (datasets) that respect and include the specific, rich, and diverse cultures of the Global South, not just the North.

In short: The researchers built a giant, culturally specific quiz to show that our current AI is a bit of a "cultural tourist"—it knows the big landmarks, but it misses the local flavor. Now that they have the quiz, we can finally teach the robots to be true locals.

Here is a detailed technical summary of the paper "Leveraging Wikidata for Geographically Informed Sociocultural Bias: Dataset Creation, Application to Latin America."

1. Problem Statement

Large Language Models (LLMs) exhibit significant cultural biases, primarily because they are trained on data dominated by the Global North (Western Europe and North America). This results in:

Cultural Inequality: Models show prejudicial behavior or lack of knowledge regarding non-English cultures, particularly in Latin America (Latam).
Resource Scarcity: There is a critical lack of high-quality datasets to detect and analyze biases in non-English languages, specifically for the diverse cultures of Latin America.
Coarse Granularity: Existing geo-cultural datasets often group countries too broadly, cover very few Latam nations, or are limited to English, failing to capture the fine-grained sociocultural nuances that distinguish countries sharing Spanish or Portuguese as official languages.

2. Methodology

The authors propose a scalable, semi-automated pipeline to create LatamQA, a geographically informed sociocultural benchmark. The methodology integrates Wikipedia content, Wikidata structure, and social science expertise.

A. Data Collection and Filtering

Source: Wikipedia articles linked to specific "Culture of [Country]" categories (e.g., "Cultura de Chile") across 20 Latin American countries and Spain.
Sociological Filtering:
1. Category Level: A sociologist manually validated subcategories to remove irrelevant content (e.g., general language definitions or school alumni lists).
2. Article Level: A fine-tuned multilingual Longformer classifier (trained on 500 manually annotated examples) filtered articles into three classes: Positive (culturally relevant), Descriptive (technical/enumerative but relevant), and Negative (irrelevant). The classifier achieved 87.5% precision on the positive class and 100% when merging positive and descriptive classes.
Outcome: This process yielded 154,000 relevant articles.

B. Cultural Taxonomy and Mapping

The authors mapped Wikipedia entities to a predefined cultural taxonomy based on Espindola and Vasconcellos (2006), defining 10 cultural elements:
- Anthroponyms, Forms of Entertainment, Local Institutions, Toponyms, Dialects, Food & Drink, Legal Systems, Scholastic References, Religious Celebrations, Fictional Characters.
An LLM (Qwen3-Max) was used in a zero-shot setting to tag Wikidata entities with these cultural elements, resulting in a distribution analysis (Figure 2) showing variations in content density (e.g., "Food and Drink" is more dominant in Spain than in Latam).

C. Question and Answer (Q/A) Generation

Prompt Strategy: The authors evaluated five different cultural definitions (anthropological, psychological, sociological, etc.). A "General Cultural Exploration" approach was selected as it produced the most pertinent questions regarding cultural identity, symbols, and collective memory.
Generation: Using gpt-oss-120b, the system generated Q/A pairs grounded strictly in the source articles.
Validation:
- Sociological Pertinence: Validated by a sociologist using a three-dimension scale (Symbolic, Social Practices, Social Representations/Memory/Identity). 98% of questions were relevant in at least two dimensions.
- Grounding: A sample of 100 answers was checked for hallucinations; 0% were found to hallucinate.
Distractor Generation: Following the CultureAtlas methodology, alternative answers (distractors) were generated by the same LLM to create Multiple-Choice Questions (MCQs). These were designed to be plausible and culturally believable.

D. Dataset Composition

LatamQA contains 26,213 articles transformed into 23,499 MCQs.
Languages: Questions were generated in Spanish and Portuguese, then translated to English for evaluation.
Coverage: 20 Latin American countries + Spain.

3. Key Contributions

Scalable Methodology: A novel pipeline combining Wikipedia category ontologies, expert sociological curation, and LLM-based generation to create large-scale, geographically specific datasets without prohibitive manual annotation costs.
LatamQA Benchmark: A comprehensive dataset of ~23k MCQs covering 20 Latam countries and Spain in three languages (Spanish, Portuguese, English), specifically designed to test fine-grained cultural knowledge.
Empirical Analysis: A rigorous evaluation of LLM performance across different model sizes, languages, and geographic regions, revealing specific biases in current models.

4. Results and Findings

The authors evaluated various LLMs (including Llama, Mistral, Qwen, GPT-4, and specialized Latam models) on the LatamQA benchmark.

Language Performance: Models consistently performed better in their native languages (Spanish or Portuguese) compared to English. This contradicts some prior studies but aligns with the high-resource nature of Spanish/Portuguese in training data.
Iberian vs. Latam Bias: All models performed significantly better on questions related to Spain (Iberian Spanish) than on questions from Latin American countries. This suggests that training data is heavily skewed toward European Spanish culture.
Country Disparities: There is a notable performance discrepancy between Latam countries. Some countries (e.g., Mexico, Argentina) are "easier" for models to answer than others, likely due to the volume of data available in the training corpora for those specific regions.
Model Size: Performance scales consistently with model size (Small $\to$ Medium $\to$ Large), with improvements of +5% to +8% accuracy.
Specialized Models:
- PatagonIA (specialized in Chilean Spanish) did not outperform general-purpose models of similar size (e.g., Mistral-medium).
- LatamGPT (Llama 3.1 70B retrained on Latam data) struggled significantly, failing to comply with MCQ formats 27% of the time and performing near-random levels on Portuguese and low levels on Latam Spanish.
Cultural Element Analysis: Performance gaps between Brazilian Portuguese and Latam Spanish were widest for cultural elements with fewer examples in the dataset, such as "Fictional Characters" and "Dialects."

5. Significance

Highlighting Data Inequality: The study provides empirical evidence that current LLMs possess a "Eurocentric" view of the Spanish-speaking world, knowing Iberian culture better than Latin American culture, despite the latter's larger population.
Benchmarking Tool: LatamQA offers a standardized, reproducible benchmark for the community to measure and mitigate sociocultural biases in NLP models, moving beyond coarse regional grouping.
Methodological Shift: It demonstrates that combining structured knowledge graphs (Wikidata) with social science expertise is a viable path to creating high-quality, culturally grounded datasets for the Global South, reducing reliance on expensive manual annotation.
Future Directions: The authors call for moving beyond simple MCQs to more interactive evaluation methods (e.g., analyzing Wikipedia Talk Pages) and involving human participants directly in the benchmark construction process to capture deeper cultural dynamics.