The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Imagine you have a super-smart artist who has memorized every painting, movie poster, and album cover ever made. You ask them to draw "The Persistence of Memory" (famous for the melting clocks).

If this artist just copies the original painting pixel-for-pixel, they are memorizing.
If they draw a clock melting on a tree in a forest, they are generalizing (using the idea but making something new).

But what if they draw a clock melting on a tree, but it looks exactly like the famous painting? That's the tricky middle ground. This is what the paper calls "Multimodal Iconicity." It's when a simple phrase (like a movie title) instantly triggers a specific, famous image in our collective cultural brain.

The authors of this paper are worried that current ways of testing AI are too simple. They ask: Is the AI just cheating by copying, or is it actually understanding the culture and reimagining it?

Here is the breakdown of their study using some everyday analogies:

1. The Problem: The "Copy vs. Create" Confusion

Think of the AI like a student taking a test.

Old Test: The teacher asks, "Is this answer the same as the textbook?" If it's even 90% similar, the teacher marks it as "cheating" (memorization).
The Reality: Sometimes, the answer should look like the textbook because it's a famous cultural reference. If you ask for "The Godfather," you expect to see a man in a suit with a horse head or a specific lighting style. If the AI draws a generic mobster, it failed to understand the culture. If it draws the exact movie poster, it might be stealing.

The old tests couldn't tell the difference between a student who understood the concept and one who just photocopied the answer key.

2. The Solution: The "Cultural Reference Transformation" (CRT)

The authors invented a new grading system called CRT. Instead of just asking "Is it a copy?", they ask two questions:

Question A: Recognition (Did you get the joke?)
- Analogy: If I say "The Dark Side of the Moon," do you draw a literal moon in space, or do you draw a prism with a rainbow (the famous album cover)?
- If you draw the prism, you passed the Recognition test. You "got" the cultural reference.
Question B: Realization (Did you make it your own?)
- Analogy: Did you trace the album cover, or did you paint a new prism with a rainbow using your own style?
- If you traced it, you failed the Realization test (too much copying). If you painted a new version, you passed.

The CRT Score is the final grade. The best AI is one that gets the joke (Recognition) but paints a new picture (Realization).

3. The Experiment: Testing the Artists

The researchers tested five different AI models (like Stable Diffusion, Flux, and Imagen) on 767 famous cultural references (movies, paintings, albums).

The Results:
- The "Photocopiers": Some models were great at recognizing the reference but terrible at changing it. They just regurgitated the original image.
- The "Abstract Artists": Some models tried to be too creative and missed the reference entirely (e.g., drawing a literal moon instead of the prism).
- The "Balanced Masters": A few models (like Imagen 4 and SD3) found the sweet spot. They knew exactly what "The Godfather" looked like, but they drew it in a fresh, unique way every time.

4. The Twist: It's Not Just About Memory

The researchers also tested what happens if you change the words slightly.

Instead of "The Scream," they asked for "The Shriek."
Instead of a title, they gave a literal description: "A man screaming on a bridge."

The Surprise: Even when the words changed, the AI still often drew the famous image. This proves the AI isn't just matching words to pictures; it has built a deep "cultural map" in its brain. It knows that "Scream" and "Shriek" both point to that specific painting.

However, they also found that how unique the title is matters. If a title is very common (like "The Kiss"), the AI gets confused and might not draw the famous painting. If the title is unique and specific, the AI locks onto the image perfectly.

5. Why This Matters

This paper is a wake-up call for how we judge AI.

Bad View: "AI is dangerous because it copies art."
New View: "AI is a cultural participant. It learns our shared stories and can retell them in new ways."

The authors argue that we shouldn't just try to make AI "forget" everything to avoid copyright issues. Instead, we need to understand how it remembers. We want an AI that can look at a cultural icon and say, "I know this story, and here is my own version of it," rather than one that just says, "Here is a photocopy of the original."

In short: The paper teaches us that being "smart" for an AI isn't just about not copying; it's about knowing the culture well enough to remix it creatively.

1. Problem Statement

Text-to-image (TTI) diffusion models face a fundamental challenge in distinguishing between generalization (learning cultural concepts) and memorization (regurgitating training data). This distinction becomes particularly ambiguous when prompts invoke culturally shared visual references (e.g., titles of famous paintings, album covers, or films).

Current evaluation metrics for memorization typically rely on global similarity scores to detect exact or near-exact copies of training images. However, these metrics fail to capture multimodal iconicity: the culturally grounded association where a text prompt (e.g., "The Dark Side of the Moon") evokes a specific visual motif (a prism and rainbow) without necessarily replicating the original image pixel-for-pixel.

The Gap: Existing frameworks treat all visual similarity as "memorization" to be eliminated. This overlooks scenarios where a model should recognize a cultural reference and reinterpret it creatively, rather than simply reproducing it. There is no nuanced metric to separate culturally grounded generalization from impermissible visual replication.

2. Methodology

The authors propose a new evaluation framework centered on the concept of Multimodal Iconicity. The framework decomposes model behavior into two distinct dimensions: Recognition and Realization.

A. Dataset Construction

Source: 767 cultural references derived from Wikidata, filtered for global prominence (using sitelink counts) and semantic type (films, TV series, artworks, albums).
Filtering: Titles containing named entities (e.g., "Mona Lisa") were removed to focus on word-visual relationships rather than identity cues.
Categories:
- Still Images: 374 references with a single canonical visual representation (e.g., paintings).
- Moving Images: 393 references with multiple possible visual realizations (e.g., films).
Prompting: Models were prompted with the reference titles only.

B. The Evaluation Framework: Cultural Reference Transformation (CRT)

The framework introduces three core metrics:

Recognition (CRA - Cultural Reference Alignment):
- Goal: Determine if the generated image evokes the intended cultural reference.
- Method: Computes the cosine similarity between CLIP (ViT-B/32) embeddings of the generated image and a set of canonical reference images.
- Threshold: A generation is "recognized" if similarity $s_i > 0.7$ .
- Metric: $CRA = \frac{1}{n} \sum \mathbb{1}[s_i > \tau]$ . This measures the proportion of generated images that successfully evoke the reference.
Realization (VR - Visual Reuse):
- Goal: Measure the extent of localized visual content reused from the reference (distinguishing transformation from replication).
- Method: Uses DINOv3 to extract patch embeddings. Images are divided into a $4 \times 4$ grid. The system calculates the maximum cosine similarity between patches of the generated image and the pool of reference patches.
- Threshold: A patch is considered "reused" if similarity $> 0.6$ .
- Metric: $VR$ is the average proportion of reused patches across recognized generations. Low VR indicates independent synthesis; high VR indicates replication.
Cultural Reference Transformation (CRT):
- Goal: A unified metric balancing recognition and originality.
- Formula: $CRT = CRA \times (1 - VR)$ .
- Interpretation: High CRT scores indicate a model that recognizes the cultural reference but reinterprets it visually (Transformation). Low scores indicate either failure to recognize (Independence) or direct copying (Regurgitation).

C. Experimental Setup

Models Evaluated: Five diffusion models (Stable Diffusion 2, XL, 3; Flux Schnell; Imagen 4).
Validation: Human evaluation (1,000 image pairs) confirmed that CRA and VR align with human judgments of cultural reference and visual reuse.
Perturbation Experiments: Tested synonym substitutions and literal image descriptions to assess linguistic sensitivity.
Correlation Analysis: Investigated factors influencing recognition (training data frequency, text uniqueness, creation date, etc.).

3. Key Contributions

Formalization of Multimodal Iconicity: Defined a new evaluation dimension for TTI models that captures culturally grounded text-image associations, moving beyond simple pixel-matching.
Decoupled Evaluation Framework: Introduced the CRT metric, which separates whether a model recognizes a reference (CRA) from how it realizes it (VR). This allows for the identification of "informed reinterpretation" versus "direct replication."
Comprehensive Benchmark: Applied the framework to 767 cultural references across five state-of-the-art models, covering both static and dynamic media.
Insights into Internalization: Demonstrated that cultural reference recognition depends not just on training data frequency, but significantly on textual uniqueness and the distinctiveness of visual cues.

4. Key Results

Model Performance Variance:
- Imagen 4 achieved the highest CRT scores, particularly for moving images, by balancing high recognition with low visual reuse (successful transformation).
- Stable Diffusion 3 (SD3) showed high recognition but relied more heavily on visual reuse compared to Imagen 4.
- Flux Schnell exhibited the lowest visual reuse but also the lowest recognition rates, often failing to evoke the cultural reference at all.
- SDXL showed high recognition but struggled with the transformation aspect, resulting in moderate CRT scores due to higher visual reuse.
Decoupling Recognition and Reuse: There is a moderate positive correlation between recognition and reuse, but they are not perfectly coupled. Some models (like Imagen 4) can recognize a reference and generate a visually distinct image, while others (like SD2/SD3 on certain prompts) recognize the reference but tend to copy visual motifs directly.
Linguistic Sensitivity:
- Synonym substitutions caused significant drops in recognition (CRA), suggesting models rely heavily on specific lexical cues.
- Literal descriptions (visual-to-text prompts) preserved recognition better than synonyms, indicating that rich visual-semantic context helps models maintain iconic associations even when the title is altered.
- Imagen 4 was the most robust to prompt perturbations.
Factors Influencing Recognition:
- Text Uniqueness: The strongest predictor of recognition. Rare or unique titles are recognized more reliably than common phrases (e.g., "A Night at the Opera" had low recognition despite high training frequency due to low text uniqueness).
- Creation Date: Older cultural works (pre-1950s) were recognized more frequently, likely due to their canonical status and repeated reproduction in training data.
- Data Frequency: While training data presence matters, it is not sufficient; distinctiveness of the cue is critical.

5. Significance and Conclusion

This paper advances the evaluation of generative AI by shifting the focus from a binary "memorization vs. generalization" view to a nuanced understanding of cultural memory.

Beyond Copyright: It provides a technical basis for distinguishing between unethical regurgitation (high VR) and legitimate cultural engagement (high CRA, low VR).
Evaluation Standard: The CRT metric offers a more sophisticated tool for benchmarking models, revealing that high-performing models differ in how they handle cultural knowledge (some prioritize fidelity to the original image, others prioritize stylistic transformation).
Future Directions: The authors note limitations regarding Western/Anglophone bias in the dataset and call for future work to expand cultural coverage and investigate causal mechanisms of how dataset properties shape cultural alignment.

In summary, the paper argues that diffusion models should be evaluated not just on what they reproduce, but on their ability to recognize, reinterpret, and transform iconic cultural content, thereby functioning as systems that encode and reshape collective visual culture.