Culture in Action: Evaluating Text-to-Image Models through Social Activities

Imagine you have a magical paintbrush that can turn any sentence you write into a beautiful picture. This is what Text-to-Image (T2I) models like DALL-E or Midjourney do. But here's the problem: if you ask this brush to paint "a family dinner in Iran," it might accidentally paint a scene that looks more like a generic movie set than a real Iranian home. It might get the food wrong, the seating arrangement wrong, or add things that don't belong there.

This paper, "Culture in Action," is about teaching these AI artists to understand the subtle, invisible rules of how different cultures actually live, interact, and celebrate.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Tourist" vs. The "Local"

Most previous tests for these AI models were like checking a tourist's photo album. They asked: "Does this picture have a red lantern?" or "Is there a taco?"

The Issue: These tests only look at objects (static things). But culture isn't just about what objects are there; it's about how people use them.
The Analogy: Imagine asking an AI to draw "eating dinner."
- Object-focused test: Did it draw a plate? Yes. Good.
- Real culture: In some countries, people sit on the floor around a cloth on the ground. In others, they sit at a tall table with chairs. In some, they eat with their hands; in others, with chopsticks.
- The Failure: Current AIs often get the "objects" right (a plate) but the "action" wrong (sitting at a table when they should be on the floor). They act like tourists who know the landmarks but don't understand the local customs.

2. The Solution: A New "Cultural Compass" (CULTIVate)

The authors built a new testing ground called CULTIVate.

What it is: Instead of asking about objects, they asked the AI to paint 576 different social activities (like dancing, greeting, or celebrating) across 16 different countries.
The Scale: They generated over 19,000 images and compared them against real photos and expert knowledge.
The Goal: To see if the AI can capture the vibe and interaction of a culture, not just the props.

3. The New Scorecard: AHEaD

The biggest innovation is a new way to grade the AI, called AHEaD. Think of it as a four-part report card that replaces the old, confusing "Image-Text Alignment" scores (which were like a teacher grading a student just by checking if they used the right keywords).

AHEaD looks at four specific things:

Alignment (The "Checklist"): Did the AI include the right cultural elements? (e.g., Did it show the right type of greeting?)
Hallucination (The "Daydream"): Did the AI invent things that don't exist? (e.g., Putting elephants in an Indonesian game that has nothing to do with elephants).
Exaggeration (The "Hollywood Effect"): Did the AI overdo the stereotypes? (e.g., Making every French person wear a beret and carry a baguette, or making every Indian scene look like a Bollywood movie).
Diversity (The "Variety"): Did the AI show just one boring version of the culture, or did it show that people have different ways of doing things?

The Magic Trick:
Old methods tried to guess if an image was "culturally correct" by comparing it to the text prompt. The authors realized this is like judging a book by its cover. Instead, they used a "Proposer-Refiner" system.

Proposer: They asked a smart AI (LLM) to brainstorm a list of what a real scene should look like (e.g., "people sitting on a sofreh").
Refiner: Another AI checked that list to remove duplicates and errors, creating a "Gold Standard" checklist.
The Grader: They then compared the AI's picture against this checklist to see exactly what was missing, what was made up, and what was exaggerated.

4. The Big Discoveries

When they ran the tests, they found some surprising things:

The "North-South" Gap: The AI models were consistently better at drawing scenes from wealthy, Western countries (Global North) than from developing nations (Global South). It's like the AI has a "Western bias" in its brain.
Old Scores Lie: The old way of grading (checking if the image matches the text) was terrible. It actually rewarded the AI for being stereotypical! If the AI drew too many elephants for an Indonesian game, the old score went up because the word "elephant" was in the prompt. The new AHEaD score correctly penalized this.
The "FAITH" Formula: They found that to get a true score of cultural faithfulness, you can't just look at what the AI got right. You have to subtract what it got wrong (Hallucinations) and what it over-did (Exaggeration). When they combined these, the score matched human judgment much better.

5. Why This Matters

This isn't just about grading AI. It's about fixing it.
Because AHEaD gives specific feedback (e.g., "You hallucinated elephants," or "You exaggerated the size of the flags"), developers can use this to edit the images.

The Analogy: Instead of just telling a student "You got a C," AHEaD says, "You forgot the salt, you added too much pepper, and you drew a cat instead of a dog. Here is the recipe to fix it."

Summary

The authors built a new way to test if AI understands human culture. They found that current AIs are like clumsy tourists who get the landmarks right but miss the local customs. Their new tool, AHEaD, acts like a cultural detective, spotting exactly where the AI is lying (hallucinating) or over-dramatizing (exaggerating), helping us build AI that respects and accurately portrays the rich diversity of human life.

1. Problem Statement

Current benchmarks for Text-to-Image (T2I) models primarily focus on object-centric artifacts (e.g., landmarks, food, clothing). However, cultural nuances are often best expressed through social activities (e.g., greeting, dining, dancing), which are contextual, compositional, and involve complex interactions between people and objects.

Existing evaluation methods suffer from two main limitations:

Ineffective Metrics: Standard Image-Text Alignment (ITA) metrics (e.g., CLIPScore) and Vision-Language Model (VLM) judges often fail to capture cultural faithfulness. They tend to reward exaggeration and hallucination (e.g., generating elephants for an Indonesian game called "elephant ant man") because these models rely on keyword matching rather than deep cultural understanding.
Scalability Issues: High-quality evaluation currently relies on expensive, non-scalable human annotation.

The paper addresses the need for a scalable, automated framework that can evaluate how faithfully T2I models depict social activities across diverse cultures, specifically distinguishing between correct cultural representation, hallucinated elements, and stereotypical exaggeration.

2. Methodology

The authors introduce a two-part solution: a new benchmark (CULTIVate) and a diagnostic evaluation framework (AHEaD).

A. The CULTIVate Benchmark

Scope: Covers 16 countries (divided into Global North and Global South) and 9 activity categories (e.g., dining, greeting, dance, celebration).
Scale: Contains 576 unique activity-country pairs and includes over 19,000 generated images from 6 state-of-the-art T2I models (including Stable Diffusion 3.5, FLUX, DALL-E 3, and proprietary models).
Ground Truth: Includes ~3,000 real reference images collected via web search and filtered by CLIPScore to serve as a baseline for realism and exaggeration.

B. The AHEaD Framework

AHEaD (Alignment, Hallucination, Exaggeration, and Diversity) is an explainable, descriptor-based framework that avoids direct VLM scoring of faithfulness. Instead, it uses visual descriptors to decompose cultural elements.

Reference Descriptor Generation (Proposer-Refiner):
- Instead of using images to define ground truth, the authors use LLMs to generate cultural descriptors ( $D_{ref}$ ).
- Proposer: Multiple LLMs generate diverse candidates for five dimensions: Background, Attire, Objects, Actions/Interactions, and Spatial Layout.
- Refiner: A separate LLM filters duplicates and errors, creating a robust, image-independent reference set.
Predicted Descriptor Extraction:
- Generated images are parsed by Multimodal LLMs (MLLMs) like InternVL3 to extract predicted descriptors ( $D_{pred}$ ).
Metric Computation:
- Alignment (ALIGN): Measures the fraction of reference descriptors present in the generated image (coverage).
- Hallucination (HAL): Quantifies the fraction of predicted descriptors that have no match in the reference set (incorrect elements).
- Exaggeration (EXAG): Measures the over-representation of stereotypical elements by comparing the intensity of specific cultural cues in generated images against real reference images.
- Diversity (DIV): Measures the semantic variety of generated outputs.
- FAITH Score: A composite metric combining ALIGN, HAL, and EXAG ( $FAITH = \text{mean}(\text{ALIGN}, 1-\text{HAL}, 1-\text{EXAG})$ ).

3. Key Contributions

CULTIVate Benchmark: The first benchmark specifically designed to evaluate T2I models on social activities across 16 countries, moving beyond static objects.
AHEaD Framework: A novel, automated evaluation suite that uses interpretable visual descriptors to measure cultural alignment, hallucination, and exaggeration without relying on costly human evaluation.
Diagnostic Capability: Unlike black-box scores, AHEaD provides descriptor-level feedback (e.g., "hallucinated elephants," "exaggerated batik"), enabling targeted image refinement and model debugging.
Empirical Findings:
- Demonstrated that ITA metrics correlate poorly with human judgment.
- Showed that FAITH (the composite metric) achieves 27% higher correlation with human judgments than MLLM-as-judge baselines.
- Revealed that Alignment alone is insufficient; penalizing hallucination and exaggeration is critical for accurate cultural evaluation.

4. Results & Analysis

Performance Disparities (Global North vs. Global South)

The study found a systematic bias across all evaluated T2I models:

Models consistently generate more faithful images for Global North countries (e.g., USA, France, Germany) compared to Global South countries (e.g., Iran, Nigeria, India).
Quantitative Gap: Global North countries showed 4–8% higher Alignment scores and lower Hallucination/Exaggeration rates compared to Global South countries.
Activity Difficulty: Models struggled most with culturally grounded activities (e.g., specific religious ceremonies or traditional games) compared to universal activities (e.g., eating, concerts).

Metric Effectiveness

ITA Failure: Metrics like CLIPScore and ImageReward showed near-zero or negative correlation with human faithfulness judgments. They often rewarded stereotypical exaggerations.
Composite Success: The FAITH metric (combining Alignment, Hallucination, and Exaggeration) significantly outperformed individual components.
- Example: Using InternVL3, FAITH achieved a Spearman correlation of 0.47 with human judgment, compared to 0.20 for the MLLM-as-judge baseline.
Negative Correlation: Effective cultural metrics must negatively correlate with Hallucination and Exaggeration. The study confirmed that as Alignment increases, Hallucination and Exaggeration should decrease—a relationship ITA metrics failed to capture.

Iterative Refinement

The framework enables descriptor-guided image editing. By identifying specific hallucinated or exaggerated descriptors, the authors demonstrated that they could prompt models to remove incorrect elements (e.g., removing "elephants" from the Indonesian game) and improve cultural faithfulness in subsequent generations.

5. Significance

This work fundamentally shifts the paradigm of evaluating T2I models from object recognition to cultural interaction.

Scalability: It provides a scalable, automated alternative to human evaluation, making cross-cultural testing feasible for large-scale model development.
Bias Detection: It offers concrete evidence of the "WEIRD" (Western, Educated, Industrialized, Rich, Democratic) bias in T2I models, specifically highlighting the performance gap between Global North and South.
Actionable Feedback: By moving beyond a single score to interpretable descriptors, the framework gives developers actionable insights to improve model training and prompt engineering for specific cultural contexts.
Industry Impact: The findings are critical for industries relying on culturally accurate content generation, such as advertising, entertainment, and education, ensuring that AI-generated content respects and accurately reflects diverse global cultures.