Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Imagine you have a magical, super-smart artist named "AI." You can tell this artist, "Draw me a scene of people cooking in the 18th century," or "Show me a group of people working in the 1950s." The artist is incredibly talented and can create beautiful pictures in seconds.

But here's the catch: Does this artist actually know history, or are they just guessing based on old movies and stereotypes?

This paper, titled "Synthetic History," is like a report card for these AI artists. The researchers wanted to see if AI models (specifically "Diffusion Models" like Stable Diffusion and FLUX) can accurately depict the past, or if they are just making up a "fake history" that looks cool but is factually wrong.

To test this, they created a massive "History Exam" called HistVis.

The Setup: The AI History Exam

The researchers didn't ask the AI to draw specific famous people (like Napoleon) or specific events (like the moon landing). Instead, they asked it to draw universal human activities (like "eating," "dancing," "working," or "praying") across 10 different time periods (from the 1600s to today).

They asked three different AI artists to draw these scenes 10 times each. That resulted in 30,000 images. They then graded these images on three specific subjects:

1. The "Style" Test: Does the AI think the past is black and white?

The Metaphor: Imagine if you asked a photographer to take a picture of a 17th-century village. If they automatically gave you a black-and-white sketch or an old-fashioned engraving, even though you didn't ask for that, they are relying on a stereotype.

The Finding: The AI artists have a "default setting" for every era.

For the 17th and 18th centuries, the AI almost always drew them like old engravings or paintings, even when asked for a realistic photo.
For the 19th century, they switched to drawings.
For the 20th century, they finally started making photographs.
The Problem: The AI isn't thinking about what the world actually looked like; it's just following a visual rulebook it learned from its training data. It assumes the past must look like an old painting.

2. The "Time Travel" Test: Did the AI bring a smartphone to the 1700s?

The Metaphor: This is the "Back to the Future" test. If you see Marty McFly wearing his sneakers in 1885, that's a mistake. The researchers looked for anachronisms—objects that didn't exist in that time period.

The Finding: The AI is a terrible time traveler.

They found modern headphones in 18th-century music scenes.
They found vacuum cleaners in 19th-century homes.
They found smartphones in 1950s photos.
The Problem: The AI focuses too much on the activity (e.g., "listening to music") and forgets the time (e.g., "1700s"). It thinks, "Music = Headphones," ignoring that headphones didn't exist yet. One model (SD3) was particularly bad at this, putting modern gadgets in almost 25% of its 1930s images.

3. The "Who Was There?" Test: Are the people realistic?

The Metaphor: Imagine a school play about the 17th century. If the director casts only white men for every single role, even in a scene where women and people of other races were historically present, the play feels wrong.

The Finding: The AI has a hard time guessing the right mix of people.

Gender: The AI often over-casts men in roles like "cooking" or "dining," even though historical data suggests women did a lot of this work. Conversely, it sometimes under-casts women in education, even when they were present.
Race: The AI tends to make almost everyone White in older time periods, only adding diversity as it gets closer to the present day. It struggles to imagine a diverse world in the past, often defaulting to a "Western-centric" view.
The Problem: The AI is projecting today's biases or a simplified version of history onto the past, rather than reflecting the complex reality of who was actually there.

The Big Conclusion

The researchers tried to "fix" the AI by giving it better instructions (like saying, "Make it look like a real photo, not a drawing"). But the AI was stubborn. It kept falling back on its old habits.

In simple terms:
Current AI models are like tourists with a camera who have never actually visited the places they are photographing. They rely on postcards and movies to guess what a place looks like. They get the general vibe right, but they miss the details, bring the wrong props, and often get the people wrong.

Why Does This Matter?

If schools, museums, or news outlets start using these AI images to teach history or show the past, they might accidentally teach us fake history. We might start believing that the past was always black and white, that only men did certain jobs, or that modern technology has always existed.

This paper is a wake-up call: Before we trust AI to tell us stories about our past, we need to teach it to be a better historian, not just a better artist.

1. Problem Statement

While Text-to-Image (TTI) diffusion models have revolutionized content creation, existing evaluations primarily focus on demographic biases (gender/race) and cultural stereotypes in contemporary contexts. There is a critical gap in understanding how these models represent historical contexts.

The Risk: TTI models may distort public understanding of history by reinforcing visual stereotypes, introducing anachronisms (modern artifacts in past settings), and failing to reflect plausible historical demographics.
The Challenge: Evaluating historical accuracy is complex because it requires assessing not just factual correctness (e.g., "is this a famous landmark?") but also implicit stylistic associations, chronological coherence, and demographic plausibility across time periods.

2. Methodology

The authors introduce HistVis, a comprehensive benchmark and evaluation framework designed to assess TTI models across three specific dimensions.

A. The HistVis Dataset

Scale: 30,000 synthetic images.
Generation: Created using three state-of-the-art open-source diffusion models: Stable Diffusion XL (SDXL), Stable Diffusion 3 (SD3), and FLUX.1 Schnell.
Prompt Design: 100 curated, neutral prompts following the template "A person [activity] in the [time period]".
- Activities: 20 universal human categories (e.g., farming, music, education) designed to be timeless to isolate the model's internal historical representation from prompt bias.
- Time Periods: 10 distinct periods spanning the 17th to 21st centuries, including specific decades of the 20th century (1910s–1990s).
- Configuration: 10 images generated per activity-period-model pair.

B. Evaluation Framework

The benchmark evaluates the generated imagery along three axes:

Implicit Stylistic Associations:
- Goal: Determine if models default to specific visual styles (e.g., engraving, painting, photography) for specific eras without explicit instruction.
- Method: A Style Predictor was trained on a WikiArt-derived dataset (1,280 images per class) using DINOv2 (ViT-B/14), which achieved the highest accuracy (89%).
- Metric: Visual Style Dominance (VSD) score, defined as the maximum proportion of images for a given period classified as a single style.
Historical Consistency (Anachronism Detection):
- Goal: Identify modern artifacts (e.g., smartphones, vacuum cleaners) appearing in pre-modern contexts.
- Method: A two-stage automated pipeline:
  1. LLM-Guided Proposal: An LLM (GPT-4o) generates a list of potential anachronisms and corresponding binary Yes/No questions for a specific prompt.
  2. VQA Detection: Three Vision-Language Models (GPT-4o, LLaMA-3.2, Qwen2.5) answer the questions for each image. A majority vote determines the final label.
- Metrics: Frequency (proportion of images containing the error) and Severity (consistency of the error when proposed).
- Validation: Validated via a user study (5,400 annotations) achieving 75% agreement with human judgment.
Demographic Representation:
- Goal: Assess if generated race/gender distributions align with historically plausible patterns.
- Method:
  1. Extraction: Use the FairFace classifier to detect gender and race in generated images.
  2. Baseline Estimation: Use an LLM (GPT-4o) to estimate "plausible" historical demographics for each activity-period pair, serving as a reference baseline (since comprehensive historical census data for all global activities is unavailable).
  3. Metric: Under/Over-representation scores measuring the absolute deviation between the model's output and the LLM baseline.

3. Key Results

A. Stylistic Bias

Systematic Defaults: Models exhibit strong, era-specific stylistic defaults even when not prompted.
- 17th–18th Centuries: SDXL defaults to engravings; SD3 and FLUX.1 default to paintings.
- Early 20th Century (1910s–1950s): All models show strong dominance of monochrome styles.
- Modern Era: FLUX.1 and SD3 converge on photography, while SDXL shifts toward illustrations.
Mitigation Failure: Prompt engineering (e.g., requesting "photorealism") failed to consistently override these deep-seated stylistic defaults, suggesting they are hard-coded in the latent space.

B. Historical Consistency (Anachronisms)

Prevalence: Anachronisms are frequent. SD3 was the most prone, with ~20% of 19th-century and 25% of 1930s images containing errors. SDXL performed best (<5% error rate).
Nature of Errors:
- High Frequency/Low Severity: Clothing errors occur often but are scattered.
- Low Frequency/High Severity: Specific objects like audio devices and ironing equipment appear rarely but, when they do, they almost always conflict with the era (Severity $\approx$ 1.0).
Implication: Models rely heavily on conceptual cues (e.g., "listening to music" $\rightarrow$ headphones) rather than temporal conditioning.

C. Demographic Representation

Gender Bias: Models frequently overrepresent males in activities where historical baselines suggest female dominance (e.g., "cooking & dining" in pre-20th centuries) or gender balance (e.g., "work & collaboration").
Racial Bias: White individuals are overwhelmingly overrepresented in earlier centuries across all models. While diversity increases in the 20th/21st centuries, specific categories (e.g., agriculture, religion) show spikes in non-White representation that may reflect training data correlations rather than historical reality.
Divergence: The generated demographics often align more with contemporary stereotypes or activity-based biases than with the specified historical period.

4. Key Contributions

HistVis Dataset: The first large-scale (30k images), structured dataset specifically designed to evaluate historical representation in TTI models across universal activities and time periods.
Novel Evaluation Protocol: A reproducible, multi-dimensional framework assessing Stylistic Bias, Chronological Consistency, and Demographic Plausibility.
Automated Anachronism Detection: A robust pipeline combining LLMs for hypothesis generation and VLMs for verification, validated against human judgment.
Empirical Evidence of Historical Failure: Demonstrates that current TTI models do not possess a coherent "understanding" of history; instead, they rely on superficial stylistic stereotypes and fail to maintain chronological coherence.

5. Significance and Future Directions

Cultural Impact: The findings highlight the risk of AI-generated imagery distorting collective memory and public understanding of history. If models default to "engravings" for the 17th century or insert smartphones into 18th-century scenes, they reinforce misconceptions.
Limitations: The study acknowledges that using LLMs for demographic baselines is a coarse approximation and that face classifiers simplify complex social identities.
Future Work: The authors call for integrating richer historical knowledge and interdisciplinary expertise to build models that can depict the past with greater accuracy and sensitivity, moving beyond "style transfer" to genuine historical reasoning.

In conclusion, this paper establishes that while TTI models are powerful generative tools, they currently lack the historical grounding necessary for reliable educational or cultural heritage applications, frequently producing "synthetic history" that is stylistically stereotyped and chronologically inconsistent.