On the Explainability of Vision-Language Models in Art History

Imagine you have a super-smart robot librarian named CLIP. This robot has read millions of books and looked at billions of pictures on the internet. It's incredibly fast at matching words to images. If you ask it, "Show me a picture of a sad clown," it can find one instantly.

But here's the problem: We don't know how it sees things.

To us, the robot is a "black box." It gives an answer, but it doesn't explain why it thinks that image is a sad clown. Is it looking at the tears? The red nose? Or maybe it just thinks "sad clown" means "a guy in a hat"?

This paper is like a detective story where the author, Stefanie Schneider, tries to open that black box, specifically when the robot is looking at art.

The Mission: Making the Robot "Speak"

The author wants to see if we can use special tools (called XAI methods) to draw a "heat map" over an artwork. This heat map is like a glowing spotlight that shows the robot exactly which part of the painting it is looking at when it thinks, "Yes, this is a snake."

The author tested seven different spotlight tools to see which one shines the light in the right place.

The Two Experiments

Experiment 1: The Math Test (The "Scorecard")

First, the author took two huge databases of famous paintings (like a digital museum) and asked the robot to find specific things: a "snake," a "saint," or a "bridge."

The Challenge: Art is tricky. A "snake" in a painting might be tiny, hidden in the shadows, or made of gold. It's not like a photo of a real snake.
The Result: The author found that one tool, called CLIP Surgery, was the best at finding the right spot. It was like a surgeon with a steady hand, pinpointing the object accurately.
The Catch: Even the best tool struggled with things that were small or very abstract. If the concept was "sadness" or a specific religious symbol that only art experts know, the robot got confused. It was like asking a robot to find "melancholy" in a painting; it doesn't know what that looks like, only what "sad faces" look like.

Experiment 2: The Human Test (The "Art Class")

Next, the author didn't just look at math scores. She asked real art students and experts to look at the same paintings and the robot's "heat maps."

The Question: "Does the robot's spotlight match where you think the important part of the painting is?"
The Result:
- When the object was clear (like a "bridge" or a "flower"), the humans and the robot agreed. The spotlight was right.
- When the object was vague or symbolic (like "lustful" or a specific "saint"), the humans and the robot disagreed. The robot's spotlight would glow on the wrong part, or spread out everywhere like a fog.
- The Surprise: The humans generally liked the CLIP Surgery tool the most, but they also liked LeGrad and ScoreCAM. The older, simpler tools (like Grad-CAM) were often like a flashlight with a broken lens—too blurry to be useful.

The Big Takeaways (The "So What?")

The Robot Has a "Short Memory": The robot was trained on the internet, which is full of photos of cats, dogs, and cars. It hasn't really "studied" art history. When it looks at a 500-year-old painting, it's trying to fit a square peg (art history) into a round hole (internet photos). It sees the visuals, but it misses the story.
Spotlights Can Be Deceptive: Just because the robot lights up a part of the painting doesn't mean it understands it. It might be lighting up the background because that's what it learned to associate with the word. It's like a parrot repeating a phrase; it sounds smart, but it doesn't know the meaning.
Art is Hard for AI: Art is full of hidden meanings, symbols, and context. A "snake" in a painting might be the devil, or it might be a symbol of healing. The robot sees the shape of the snake, but it doesn't know the story behind it.

The Final Verdict

The paper concludes that these "spotlight" tools are helpful, but they aren't magic. They can show us where the robot is looking, but they can't tell us what the robot is thinking.

Think of it this way:
If you ask a tourist to describe a famous cathedral, they might say, "It has a big pointy roof." That's accurate, but they miss the stained glass, the history, and the feeling of awe. The robot is that tourist. The "spotlight" tools just show us exactly which part of the roof the tourist is staring at.

The lesson for us: We can use these tools to help us study art, but we must never forget that the robot is just a machine. It needs human experts to interpret the story, because the robot only sees the pixels, not the poetry.

1. Problem Statement

Vision-Language Models (VLMs), particularly CLIP (Contrastive Language–Image Pre-training), are increasingly used in digital art history for retrieval and analysis. However, their internal mechanisms are opaque ("black boxes"), and their training data (e.g., LAION-400M) contains structural biases and lacks the nuanced, culturally sedimented conventions required for art-historical interpretation.

The core research problem is twofold:

Epistemic Opacity: It is unclear what visual concepts (formal, iconographic, or affective) VLMs actually encode when processing art.
Explainability Gap: Current Explainable AI (XAI) methods, designed for general computer vision, may not effectively render the "visual logic" of CLIP legible to human experts in art history. The paper asks: Do XAI methods reveal a model's internal conceptual structure, or do they merely aestheticize its opacity?

2. Methodology

The study employs a two-stage evaluation framework to assess seven XAI methods across three paradigms:

Gradient-based: Grad-CAM, Grad-CAM++, LayerCAM, LeGrad.
Score-based (Gradient-free): ScoreCAM, gScoreCAM.
CLIP-specific: CLIP Surgery (intervenes directly in the inference pipeline).

Case Study 1: Quantitative Localization (Zero-Shot)

Goal: Measure the ability of XAI methods to localize iconographic objects without fine-tuning.
Datasets:
- IconArt: 1,480 images focusing on specific iconographic motifs (e.g., "Crucifixion," "Saint Sebastian").
- ArtDL: 4,166 images with broader categories (e.g., "face," "beard").
Metric: BoxAcc (Bounding Box Accuracy). The study generates saliency maps, thresholds them to create binary masks, and extracts the tightest bounding box around the largest connected component. Performance is measured against ground-truth annotations using Intersection over Union (IoU) at thresholds $\ge 0.30$ and $\ge 0.50$ .
Constraint: Evaluation is strictly zero-shot (no retraining or fine-tuning of CLIP).

Case Study 2: Human Interpretability Study

Goal: Assess whether saliency maps align with human visual judgment and art-historical expertise.
Design: An online survey with 33 participants (ranging from students to experts in art history).
Task: Participants annotated relevant regions in 7 diverse artworks for specific text prompts (e.g., "snake," "lustful," "Virgin Mary"). They then ranked the 7 generated saliency maps based on how well the maps reflected their own annotations.
Analysis: Inter-rater reliability was measured using Kendall's W to determine consensus on which XAI method was most "faithful" to human perception.

3. Key Contributions

Comparative Benchmarking: Provides the first systematic comparison of seven XAI methods specifically applied to CLIP in the context of art history, moving beyond generic object detection.
Dual-Validation Framework: Combines quantitative localization metrics with qualitative human-in-the-loop evaluation, bridging the gap between algorithmic performance and human interpretability.
Identification of "Conceptual Stability": Demonstrates that XAI effectiveness is not uniform but depends heavily on the conceptual stability and representational availability of the target category within the model's embedding space.
Methodological Critique: Argues that XAI outputs in art history should be viewed as "prompts for hermeneutic inquiry" rather than definitive explanations, highlighting the tension between machine vision and human interpretive conventions.

4. Key Results

Quantitative Findings (Case Study 1)

Top Performer: CLIP Surgery consistently achieved the highest localization accuracy across both datasets, particularly at the stricter IoU $\ge 0.50$ $\geq 0.50$ threshold.
- ArtDL: 52.28% BoxAcc (IoU $\ge 0.30$ ), significantly outperforming the second-best (LeGrad at 43.82%).
- IconArt: 28.76% BoxAcc (IoU $\ge 0.30$ ).
Runner-up: LeGrad performed strongly, often outperforming CLIP Surgery on medium-sized objects or specific classes (e.g., "baby Jesus").
Underperformers: Traditional gradient-based methods (Grad-CAM, Grad-CAM++, LayerCAM) showed significant performance degradation, particularly on the more complex IconArt dataset.
Scale Dependency: Accuracy correlated strongly with object size; small objects (common in IconArt) were significantly harder to localize.

Qualitative Findings (Case Study 2)

Human Preference: Participants generally ranked CLIP Surgery, LeGrad, and ScoreCAM highest.
Expertise Impact: While basic knowledge participants favored CLIP Surgery, those with intermediate/advanced expertise showed a slight preference for LeGrad.
Concept Ambiguity:
- High Agreement: For concrete, spatially localized objects (e.g., "snake," "bridge," "left foot"), inter-rater reliability (Kendall's W) was high ( $>0.60$ ), and XAI maps aligned well with human annotations.
- Low Agreement: For abstract, symbolic, or context-dependent concepts (e.g., "lustful," "Virgin Mary" in complex scenes), reliability dropped ( $<0.30$ ). Human annotators themselves struggled to agree on ground truth, and no single XAI method could resolve this ambiguity.
Failure Modes: In cases where the model lacked specific training data (e.g., "thieves" in a Crucifixion scene), saliency maps became diffuse or incoherent, reflecting the model's inability to encode the concept rather than a failure of the XAI method itself.

5. Significance and Implications

Beyond "What the Model Sees": The study concludes that XAI methods do not simply reveal a model's "understanding." Instead, they visualize the statistical residue of the training data. If a concept is absent or under-represented in the training corpus (e.g., specific iconographic details), no XAI method can make it visible.
Methodological Robustness: For digital art history, CLIP Surgery is recommended for its superior localization and efficiency (single forward pass), though LeGrad and ScoreCAM remain viable alternatives depending on the specific conceptual target.
Epistemic Warning: The paper warns against treating saliency maps as self-sufficient explanations. In art history, where meaning is constructed through dialogue and cultural context, XAI maps are merely interfaces between human and machine vision. They highlight where the model attends but cannot explain why in a way that satisfies humanistic inquiry.
Future Directions: The authors advocate for a "social model of explanation" where XAI is used dialogically to expose the biases and limitations of machine vision rather than to validate them.