GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models

Imagine you have a magical camera that can take a picture of anything you describe. You say, "Take a photo of a house in Nigeria," and snap! It creates one. You say, "Take a photo of a house in Japan," and snap! It creates another.

This is how Text-to-Image (T2I) models work today. They are incredibly popular, but there's a problem: they are biased.

If you ask this magical camera to show you a house in Nigeria, it might always show you a crumbling, dusty shack. If you ask for a house in Japan, it might always show you a pristine, futuristic apartment. It's as if the camera has a broken lens that only sees the world through a very narrow, stereotypical filter. It forgets that Nigeria has modern skyscrapers and Japan has old, rustic villages, too.

This paper introduces a new tool called GeoDiv (Geographical Diversity) to fix this. Think of GeoDiv as a "World-Check Inspector."

The Two Main Tools in the Inspector's Kit

The researchers built GeoDiv to measure diversity in two specific ways, like checking a car for both its engine and its paint job.

1. The "Socio-Economic Visual Index" (SEVI) – The "Wealth & Condition" Check

This part of the inspector looks at the vibe of the image. It asks two big questions:

Affluence: Does this look rich or poor? (Is it a mansion or a shack?)
Maintenance: Does this look brand new and well-cared-for, or is it broken and worn out?

The Analogy: Imagine you are judging a neighborhood.

The Bias: If the camera always shows Nigeria as a "broken-down, poor neighborhood" and the USA as a "perfect, shiny suburb," the SEVI score will be terrible. It reveals that the AI is reinforcing the stereotype that some countries are always poor and others are always rich.
The Finding: The paper found that models like FLUX.1 are great at making things look "shiny and rich" (high maintenance), but they make every country look the same. Meanwhile, older models often make developing countries look "broken and poor" by default.

2. The "Visual Diversity Index" (VDI) – The "Variety" Check

This part of the inspector looks at the details. It asks: "Are all the houses the same color? Are all the roads the same type?"

Entity Appearance: What does the object look like? (Is the car a red sedan or a blue truck? Is the house made of brick or mud?)
Background Appearance: What's around it? (Is the road paved with asphalt, or is it a dirt path? Are there mountains or just flat fields?)

The Analogy: Imagine a box of crayons.

Low Diversity: If you ask for "a car in 10 different countries" and the AI gives you 10 identical red sedans on a paved road, that's like having a box with only one red crayon. It's boring and fake.
High Diversity: A good AI should give you a red sedan in the US, a tuk-tuk in India, a dirt bike in Kenya, and a vintage car in Italy. That's a full box of crayons!
The Finding: The paper found that while newer AI models are getting better at making things look "real," they are actually getting worse at showing variety. They are becoming too uniform.

How Does GeoDiv Actually Work?

Instead of a human looking at 160,000 pictures (which would take forever), GeoDiv uses AI assistants (Large Language Models and Vision-Language Models) to do the heavy lifting.

The Interviewer: The AI acts like a reporter. It looks at a picture of a house in Nigeria and asks, "Is the roof flat or sloped? Is the road dirt or paved? Does this look wealthy or poor?"
The Scorekeeper: It counts the answers. If 90% of the houses in Nigeria have dirt roads and 90% of the houses in the UK have paved roads, the AI knows there is a bias.
The Report Card: It gives the AI model a score. A high score means the AI shows the world as it really is (diverse and varied). A low score means the AI is stuck in a stereotype.

What Did They Discover?

The "World-Check Inspector" found some shocking things:

The "Poor Country" Trap: When asked to generate images of countries like India, Nigeria, and Colombia, the AI almost always made them look impoverished and dilapidated. It rarely showed them as modern or wealthy.
The "Rich Country" Filter: When asked for USA, UK, or Japan, the AI almost always made them look affluent, clean, and perfect.
The "One-Size-Fits-All" Problem: Newer models (like FLUX.1) are so good at making things look "pretty" that they make every country look like a wealthy Western suburb. They lost the unique cultural flavors of different places.

Why Does This Matter?

If we let these AI models keep making these biased pictures, they will start to shape how we see the world. If an AI always shows Nigeria as a place of poverty, people might start to believe that's all there is to Nigeria.

GeoDiv is the first tool that gives us a "report card" for these models. It doesn't just say "this looks bad"; it tells us exactly where the bias is (e.g., "You are making all Nigerian roads look like dirt").

The Bottom Line

This paper is like a mirror held up to Artificial Intelligence. It shows us that while AI is amazing at creating art, it is currently a very bad traveler. It only knows the stereotypes.

GeoDiv is the compass that helps developers fix the map, ensuring that when we ask AI to show us the world, it shows us the real world—messy, diverse, beautiful, and full of surprises, not just a collection of stereotypes.

1. Problem Statement

Text-to-Image (T2I) models, trained on internet-scale data, often fail to represent the world accurately. Instead, they reinforce harmful socio-economic and regional stereotypes. For example, prompts like "a photo of a car in Africa" frequently generate images of dilapidated vehicles on dirt roads, while the same prompt for "Japan" yields polished, affluent scenes.

Limitations of Existing Metrics: Current diversity metrics (e.g., FID, Vendi-Score) rely on low-level visual dissimilarity or curated reference datasets. They lack interpretability and fail to capture deep, country-specific patterns regarding socio-economic status, physical maintenance, and contextual variation.
The Gap: There is a need for an automated, reference-free framework that can quantify geo-diversity across multiple dimensions (economic, environmental, and visual) to audit generative models for bias.

2. Methodology: The GeoDiv Framework

The authors propose GeoDiv, a systematic framework leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs) to assess diversity along two complementary axes: the Socio-Economic Visual Index (SEVI) and the Visual Diversity Index (VDI).

A. Core Axes

Socio-Economic Visual Index (SEVI):
- Affluence: Rates the overall wealth depicted in an image (Scale 1–5: Impoverished to Luxury).
- Maintenance: Rates the physical condition of the primary entity (Scale 1–5: Severely Damaged to Excellent).
- Mechanism: A VLM (specifically Gemini-2.5-flash in the study) is prompted with detailed scale definitions to score each image individually.
Visual Diversity Index (VDI):
- Entity-Appearance: Measures variation in attributes of the primary object (e.g., shape, material, color of a house or car).
- Background-Appearance: Measures variation in scene context (e.g., road type, presence of infrastructure, indoor vs. outdoor).
- Mechanism:
  - Attribute Generation: An ensemble of LLMs generates diverse Question-Answer (Q&A) pairs for specific entities and a fixed set for backgrounds.
  - VQA Execution: A VQA model answers these questions for every generated image.
  - Diversity Calculation: The distribution of answers is quantified using the Hill Number (exponential of Shannon entropy), normalized to a 0–1 scale. This represents the "effective number of distinct categories" present.

B. Experimental Setup

Models Evaluated: Stable Diffusion v2.1, v3 (medium), v3.5, and FLUX.1-dev.
Scope: 16 countries (spanning Americas, Europe, Asia, Africa) and 10 common entities (e.g., house, car, dog, stove).
Dataset: 160,000 synthetic images generated (250 images per entity-country-model pair).
Validation: Extensive human studies were conducted using crowdworkers (via Prolific) to validate VQA accuracy and SEVI score correlations. The study found that Gemini-2.5-flash achieved the highest correlation with human ratings ( $\rho \approx 0.76$ for Affluence).

3. Key Contributions

Novel Framework: Introduction of GeoDiv, the first interpretable framework to measure geo-diversity using LLM/VLM world knowledge, separating socio-economic cues from visual variation.
Structured Resources: Release of structured attribute-value sets (Q&A pairs) for 10 entities and full prompts, enabling reproducibility and extension to new entities/countries.
Large-Scale Dataset: Curation and release of a 160,000-image synthetic dataset covering 16 countries and 4 models, along with human-annotated subsets for benchmarking.
Diagnostic Tool: A systematic method to audit and identify fine-grained biases (e.g., specific materials used for houses in Egypt vs. the UK) that existing metrics miss.

4. Key Results & Findings

The application of GeoDiv to the synthetic dataset revealed several critical insights:

Systematic Socio-Economic Bias:
- Countries like India, Nigeria, and Colombia were consistently depicted as impoverished and poorly maintained (low SEVI scores).
- Conversely, USA, UK, and Japan were overwhelmingly portrayed as affluent and pristine.
- No Model Offered Balance: No single model generated images spanning the full spectrum of socio-economic strata for all countries.
Visual Diversity Deficits:
- Backgrounds are Homogeneous: Background diversity scores were significantly lower than entity diversity scores (avg 0.33 vs 0.47). Most backgrounds were "quiet and empty," lacking crowds or varied infrastructure.
- Country-Specific Stereotypes:
  - Houses: SD3.5 generated 99% stone houses for Egypt but 88% brick houses for the UK.
  - Roads: 77% of car images from Nigeria showed dirt/gravel roads, whereas 85% of US images showed paved roads.
  - Materials: SD3.5 showed very few cushioned chairs for Nigeria/Philippines but rarely hard-seated chairs for the UK/USA.
Model-Specific Trends:
- FLUX.1: Produced the most "polished" and affluent images (highest SEVI scores) but had the lowest visual diversity (VDI). This suggests a trade-off where newer models prioritize aesthetic consistency over representational variety.
- Stable Diffusion v2.1: Showed the highest overall diversity scores among the tested models, though still insufficient compared to real-world data.
- Newer Models: Diversity tends to decrease in newer diffusion model versions (SD3.5 < SD3 < SD2.1).
Comparison with Real Data:
- When compared to the real-world GeoDE dataset, synthetic images showed significantly lower diversity in Entity Appearance and Maintenance, confirming that current T2I models fail to capture real-world variation.

5. Significance and Impact

Interpretability: Unlike black-box metrics (e.g., Vendi-Score), GeoDiv provides specific, human-readable reasons for low diversity (e.g., "lack of paved roads in Nigeria" or "uniform stone houses in Egypt").
Actionable Insights: The framework allows practitioners to identify specific attributes to target for improvement. The authors demonstrated a mitigation strategy where prompt engineering based on GeoDiv scores successfully increased diversity by ~0.33.
Future of Generative AI: GeoDiv highlights that "fairness" in generative AI requires more than just visual quality; it demands geographical nuance and the inclusion of diverse socio-economic realities. It sets a new standard for auditing generative models to ensure they do not perpetuate harmful global stereotypes.

In conclusion, GeoDiv provides a rigorous, scalable, and interpretable methodology to expose and quantify the geographical biases inherent in current text-to-image models, marking a critical step toward building more inclusive and representative generative systems.