VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Imagine you have a very smart robot that can look at a picture and tell you exactly what it sees. If you show it a photo of a cat, it says, "That's a cat." If you show it a sunset, it says, "That's orange and blue." This is what current AI models are great at: seeing the surface.

But what happens if you show that same robot a 500-year-old Chinese painting of a plum blossom?

The Robot says: "I see a tree with flowers. It's painted with black ink."
A Human Expert says: "This isn't just a tree. In Chinese culture, the plum blossom represents resilience because it blooms in winter. The way the ink is washed suggests the artist's spirit, and the empty space around it represents the concept of 'void' in philosophy."

The robot sees the object, but it misses the soul of the artwork.

This paper introduces VULCA-BENCH, a new "test" designed to see if AI can understand that soul. Here is how it works, explained simply:

1. The Problem: AI is "Superficially Smart"

Current AI tests are like a driver's license exam where you only have to recognize stop signs and traffic lights. They don't test if you understand why we drive, the history of the roads, or the philosophy of safety.

Existing AI: Good at saying, "That is a red flower."
Missing AI: Bad at saying, "That red flower symbolizes the struggle of life in a harsh winter."

2. The Solution: A 5-Story Building of Understanding

The authors built a new benchmark called VULCA-BENCH (Vision-Understanding-Language-Culture Assessment). They imagine cultural understanding as a 5-story building:

Floor 1 (Visual Perception): "I see a red flower." (Easy for AI)
Floor 2 (Technical Analysis): "It's painted with ink on paper." (AI is okay at this)
Floor 3 (Cultural Symbolism): "In this culture, red flowers mean luck." (AI starts to struggle)
Floor 4 (Historical Context): "This was painted 300 years ago by a specific artist who was exiled." (AI gets confused)
Floor 5 (Philosophical Aesthetics): "This painting expresses the idea that beauty comes from imperfection and the passage of time." (AI usually fails completely)

The Big Discovery: When they tested top AI models on this "building," the models were great on the first two floors but fell off the building on the third floor. They can describe the what, but they can't explain the why.

3. The Test: A Global Art Gallery

To test this fairly, the researchers didn't just look at Western art (like paintings in the Louvre). They built a global gallery with 8 different cultural traditions:

Chinese, Western, Japanese, Korean, Islamic, Indian, Mural (Buddhist), and Russian/European.

They collected 7,410 pairs of images and expert critiques.

The "Symmetry" Rule: To make sure the test wasn't biased, they treated every culture with the same respect. Even if there were fewer Japanese paintings than Western ones, they analyzed them using the same deep, 5-step method. It's like grading a math test where everyone gets the same hard questions, regardless of how many students are in the class.

4. The Bilingual Twist

The test uses Chinese and English side-by-side.

Why? Some concepts, like the Chinese idea of Qi Yun (spirit resonance), are hard to translate perfectly into English. By keeping the original Chinese terms alongside English explanations, the test ensures the AI isn't just guessing; it has to grasp the specific cultural flavor.

5. What Happened When They Tested the AI?

They ran 5 of the smartest AI models through this test.

The Result: The AI models got high scores on the "Visual" floors (1 & 2) but their scores dropped by 30–40% on the "Deep Culture" floors (3, 4, & 5).
The Mistakes: The AI made funny but telling errors:
- Fake Expertise: It would use fancy words like "spirit resonance" but couldn't explain what it actually looked like in the painting.
- Time Travel Errors: It would describe a 16th-century painting using rules from the 19th century.
- Mixing Cultures: It would confuse a Persian painting with an Indian one because they looked "kind of similar" to the robot's eyes.

Why Does This Matter?

Think of AI as a new student in a global classroom. Right now, the student is very good at memorizing facts and recognizing objects, but they are terrible at understanding culture, history, and human emotion.

VULCA-BENCH is like a new report card that tells us: "Hey, your AI is smart, but it's culturally illiterate."

This is a crucial step forward. If we want AI to help us in museums, schools, or cross-cultural communication, it needs to stop just "seeing" pictures and start "understanding" the stories behind them. This paper gives us the tool to measure that progress and shows us exactly where the robots need to study harder.

1. Problem Statement

Current Vision-Language Models (VLMs) excel at Level 1 (L1) and Level 2 (L2) tasks, such as object recognition, scene description, and factual question answering. However, existing benchmarks fail to evaluate higher-order cultural understanding, which involves interpreting symbolic meanings, appreciating aesthetic traditions, and engaging with philosophical concepts embedded in visual content.

The Gap: VLMs often suffer from "Western bias" and lack the ability to perform deep cultural reasoning (e.g., distinguishing between qiyun shengdong in Chinese painting or wabi-sabi in Japanese art).
Limitations of Prior Work: Previous cultural datasets are often single-culture focused, rely on simple question-answering (factual recall) rather than generative critique, or lack a hierarchical framework to distinguish between surface-level perception and deep philosophical interpretation.

2. Methodology

A. The Five-Layer Cultural Understanding Framework

The authors operationalize cultural understanding using a hierarchical framework inspired by Panofsky's iconological method, divided into five layers:

L1 (Visual Perception): Color, composition, line quality, spatial layout.
L2 (Technical Analysis): Medium, materials, craftsmanship, preservation state.
L3 (Cultural Symbolism): Motifs, iconography, narrative, and symbolic meanings.
L4 (Historical Context): Period, artist biography, schools, provenance, and influences.
L5 (Philosophical Aesthetics): Artistic conception, aesthetic theory, cultural values, and innovation (e.g., yijing or rasa).

B. Dataset Construction: VULCA-BENCH

Scale & Scope: The dataset contains 7,410 matched image–critique pairs spanning 8 cultural traditions: Chinese, Western, Japanese, Korean, Islamic, Indian, Mural (Buddhist), and Hermitage (Russian/European).
Dimensions: It defines 225 culture-specific dimensions (25–30 per tradition) tailored to each culture's indigenous aesthetic vocabulary (e.g., chiaroscuro for Western, qiyun for Chinese).
Bilingual Annotation: All critiques are provided in both Chinese and English. This preserves precise cultural terminology (via romanization) while ensuring international accessibility.
Expert Involvement: Annotations were performed by native experts (PhD/Masters level) with 10+ years of specialization. Each critique must cover ≥70% of the culture-specific dimensions and meet length thresholds (≥150 Chinese characters / ≥100 English words).

C. The Cultural Symmetry Principle

To ensure fair cross-cultural evaluation without requiring equal sample sizes (which is often impossible due to museum digitization disparities), the authors introduce the Cultural Symmetry Principle:

Schema Parity: All cultures use the same L1–L5 structure.
Protocol Parity: Identical annotation protocols and quality thresholds.
Dimension Adaptation: While the framework is uniform, the specific dimensions adapt to each culture's complexity (e.g., Chinese has 30 dimensions due to rich philosophical aesthetics; Western has 25 focusing on formal analysis).
Balanced Variants: The authors release a Balanced-Pilot subset (N=336) with 48 samples per culture (excluding Hermitage) to enable equal-weighted fairness probing.

3. Key Contributions

VULCA-BENCH Dataset: The first comprehensive multicultural art-critique benchmark with 7,410 expert-annotated pairs, 225 dimensions, and full bilingual coverage.
Cultural Symmetry Principle: A methodological framework that enforces schema and protocol parity across cultures, allowing for fair evaluation despite imbalanced sample sizes.
Hierarchical Evaluation Schema: A formal L1–L5 framework that isolates specific VLM competencies, moving beyond simple visual perception to philosophical aesthetics.
Diagnostic Tool (DCR): Introduction of the Dimension Coverage Rate (DCR) as a metric to measure how well models engage with culture-specific concepts, rather than just factual accuracy.

4. Results (Pilot Evaluation)

The authors evaluated 5 state-of-the-art VLMs (GPT-4o, Claude-Sonnet-4.5, Gemini-2.5-Pro, Qwen3-VL, GLM-4V) on the balanced-pilot subset.

The "Layer Gap": All models exhibited a significant performance drop when moving from visual perception to cultural interpretation.
- L1–L2 Performance: High accuracy (78%–89%).
- L3–L5 Performance: Significantly lower (40%–58%).
- The Gap ( $\Delta L$ ): A consistent 25–40 percentage point drop across all models, confirming that cultural reasoning is a qualitatively different and harder capability than visual perception.
Model Ranking: Rankings on the balanced subset correlated strongly ( $\rho = 0.94$ ) with the full corpus, validating the subset's utility for fair comparison.
Error Analysis: Models frequently failed due to:
1. Surface-level Terminology: Citing terms like qiyun without explaining visual manifestations.
2. Historical Anachronism: Applying later artistic conventions to earlier works.
3. Cultural Conflation: Confusing distinct traditions (e.g., Persian vs. Mughal miniatures).
Few-Shot Impact: Naive few-shot prompting with expert critiques often decreased performance for some models (e.g., DeepSeek-VL2 dropped by 41.3%), suggesting current models struggle to integrate complex cultural context effectively.

5. Significance

Beyond Perception: VULCA-BENCH shifts the focus of VLM evaluation from "what is in the image" to "what does the image mean culturally," addressing a critical blind spot in current AI.
Non-Western Inclusivity: By explicitly including Asian, Middle Eastern, and South Asian traditions with native expert validation, it challenges the Western-centric bias prevalent in current benchmarks.
Research Utility: The dataset enables:
- Fine-tuning: Training culturally aware VLMs.
- Interpretability: Using the 225 dimensions to probe specific failure modes.
- Fairness: Providing a standardized way to test cross-cultural equity via the balanced subsets.
Open Access: The dataset, code, and evaluation tools are released under CC BY 4.0, fostering reproducibility and further research into culturally grounded AI.

In conclusion, VULCA-BENCH demonstrates that while VLMs are becoming proficient at visual tasks, they remain fundamentally limited in their ability to perform deep, hierarchical cultural reasoning, highlighting a clear direction for future architectural improvements.