The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

Imagine you are hiring a very strict, very powerful art critic to help you build a massive museum. This critic doesn't just look at art; they decide which paintings get to be in the museum, which ones get thrown in the trash, and which ones get the gold stars.

This paper is about investigating who this critic is, what they actually like, and why they like it. The "critic" in this story is an AI tool called the LAION-Aesthetics Predictor (LAP). It's a piece of software used by the creators of famous AI art generators (like Stable Diffusion) to decide what images are "good" and what images are "bad."

Here is the story of the paper, broken down simply:

1. The Problem: One Size Does Not Fit All

The authors started with a big question: Whose taste is this AI using?
Art is subjective. What one person finds beautiful, another might find boring. But AI models need a single, universal rule to decide what is "high quality." The LAP model acts as this rulebook. The authors suspected that this rulebook wasn't actually neutral; it was probably biased toward a specific type of person and a specific type of art.

2. The Investigation: The "Audit"

The researchers acted like detectives. They ran the LAP model over millions of images to see what it kept and what it threw away. They used three different "test cases":

The Big Data Test (LAION-Aesthetics Dataset): They looked at 1.2 billion images that the AI had already filtered.
- The Discovery: The AI was like a bouncer at a club who loved letting in photos of women but kept men and LGBTQ+ people out. It also loved images mentioning Christians or Hindus but filtered out Jews and Muslims.
The Museum Test (The Met Museum): They fed the model images from the famous Metropolitan Museum of Art.
- The Discovery: The AI only gave high scores to Western and Japanese art (like realistic landscapes and portraits). It gave almost zero stars to African, Native American, Islamic, or Egyptian art. It was as if the AI thought, "If it's not a realistic painting from Europe or Japan, it's not art."
The Style Test (WikiArt): They looked at different art styles.
- The Discovery: The AI loved realism. It gave high scores to photos and paintings that looked exactly like real life. It hated abstract art, cubism, or anything weird and surreal. It was like a critic who only likes photorealistic portraits and thinks Picasso is a joke.

3. The Origin Story: The "Trace Ethnography"

The researchers asked: Why is this AI so biased? To find out, they didn't just look at the code; they looked at the people who made it. This is called a "trace ethnography"—basically, digging through the digital breadcrumbs of how the tool was built.

They found that the LAP model was built by one man (the founder of LAION) based on his own personal taste and the data he could easily find.

The Data: The training data came mostly from English-speaking photographers on a website called dpchallenge and a group of Western AI enthusiasts on Discord.
The Result: The AI wasn't learning "universal beauty." It was learning what a specific group of Western, English-speaking, tech-savvy men thought was beautiful.
The Flaw: The creator mixed up different types of ratings. He took ratings from photography contests (where you judge a photo against others in the same category) and mixed them with ratings from AI art bots (where you just rate an image 1-10). It's like mixing a score from a "Best Landscape" contest with a score from a "Best Abstract Painting" contest and calling it a single "Art Score."

4. The Three "Gazes"

The authors describe the AI's bias using three powerful metaphors:

The Imperial Gaze: The AI acts like a colonial ruler. It decides that Western and Japanese art is the "standard" of beauty and ignores everything else (African, Indigenous, Middle Eastern art). It reinforces the idea that Western culture is the only culture that matters.
The Realist Gaze: The AI is obsessed with things looking "real." It rejects modern art, abstract ideas, and surrealism. This is dangerous because it limits the creativity of AI. If the AI only learns to make "realistic" things, it can't help artists create weird, dreamy, or abstract masterpieces.
The Male Gaze: The AI seems to view the world through the eyes of a straight man. It loves images of women (often objectifying them) but ignores men and LGBTQ+ people. This is a huge problem because it means AI art generators are more likely to create images of women, potentially leading to more deepfakes and non-consensual sexual imagery, while ignoring other identities.

5. The Big Takeaway

The paper argues that we need to stop trying to find a single "perfect" definition of beauty for AI.

Don't pretend there is one "best" way to see art. There isn't.
Be honest about what the AI is doing. Instead of saying, "This AI is great at judging quality," we should say, "This AI is great at judging photorealistic Western landscapes."
Diversify the critics. We need to build AI systems that understand many different cultures and styles, not just the taste of one group of people.

In short: The paper reveals that the "judge" deciding what AI art looks like is actually just a mirror reflecting the biases of a few Western men. If we want AI to create art for everyone, we need to break that mirror and let in more voices.

Here is a detailed technical summary of the paper "The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor."

1. Problem Statement

Visual generative AI models (e.g., Stable Diffusion) rely heavily on Aesthetic Quality Assessment (AQA) models to curate training datasets and evaluate generated outputs. The LAION-Aesthetics Predictor (LAP) is a widely used AQA model that filters billions of images to create the LAION-Aesthetics Dataset, which serves as the foundation for training state-of-the-art generative models.

The core problem addressed is that "aesthetic" quality is inherently subjective, cultural, and political, yet AQA models treat it as an objective, universal metric. The authors investigate:

RQ1: What specific types of images does LAP deem "high quality," and what biases does this introduce into training data and model evaluation?
RQ2: How was LAP developed, and what upstream design choices and data sources contributed to its specific biases?

The paper argues that the "algorithmic gaze" of LAP reinforces historical biases (imperial, male, and realist gazes), potentially perpetuating representational harms and limiting the diversity of AI-generated art.

2. Methodology

The authors employed a mixed-method approach combining quantitative algorithmic auditing with qualitative trace ethnography.

A. Algorithmic Auditing (Quantitative)

The authors audited LAP across three distinct datasets to analyze its scoring behavior:

LAION-Aesthetics Dataset (LAD): A curated subset of ~1.2 billion images from LAION-5B filtered by LAP. The authors analyzed the distribution of captions and image sources for images scoring $\ge$ 6.5 (the threshold for "high quality").
Metropolitan Museum of Art (MET) Dataset: 249,351 public-domain images spanning 5,000 years of art history. The authors scored these images to test LAP's performance across different cultures, mediums, and time periods.
WikiArt Dataset: 81,444 images of modern art (mid-1800s to mid-1900s) by 129 artists. This dataset allowed for granular analysis of genres (e.g., landscape, portrait), styles (e.g., realism, abstract), and specific artists.

Analysis Techniques:

Threshold Analysis: Comparing image distributions above and below the 6.5 score threshold.
Pointwise Mutual Information (PMI): Calculating the statistical association between specific regex patterns in image captions (e.g., gender, race, religion terms) and high aesthetic scores.
Qualitative Visual Analysis: Random sampling of images across score ranges to identify visual patterns (e.g., photorealism vs. abstraction).

B. Trace Ethnography (Qualitative)

To understand the origin of the biases, the authors conducted a "trace ethnography" of the public materials surrounding LAP's creation. This involved:

Analyzing the LAION-Aesthetics announcement blog post.
Examining code repositories and data cards for the training datasets (AVA, SAC, LAION-Logos).
Reviewing an explanatory YouTube video by the model's creator (Christoph Schuhmann).
Investigating the documentation (or lack thereof) for the training datasets.

3. Key Contributions

Empirical Audit of LAP: The first detailed empirical investigation into what the LAION-Aesthetics Predictor actually measures in practice, moving beyond theoretical critiques.
Identification of the "Algorithmic Gaze": The paper conceptualizes LAP's bias as a convergence of three historical gazes:
- The Imperial Gaze: Preference for Western and Japanese art over African, Indigenous, and Islamic art.
- The Realist Gaze: Preference for photorealism and figurative art over abstract or modernist styles.
- The Male Gaze: Disproportionate inclusion of images featuring women and exclusion of LGBTQ+ and male-centric narratives.
Trace Ethnography of AI Development: Demonstrating how "open" models are often built on "slapdash" processes, relying on the individual taste of a single developer and poorly documented, non-consensual data sources.
Methodological Proposal: Advocating for the combination of algorithmic audits and trace ethnographies to fully understand the socio-technical origins of algorithmic bias.

4. Key Results

A. Biases in the LAION-Aesthetics Dataset

Gender & Identity: Images with captions mentioning women are significantly more likely to score $\ge$ 6.5. Conversely, images mentioning men or LGBTQ+ identities are disproportionately filtered out.
Religion & Race: Images mentioning Hindus, Buddhists, and Christians are favored, while those mentioning Jews or Muslims are filtered out. Images of Asian and African communities are rated higher than those of Caucasian/European communities (though the authors note this may be due to "whiteness" being an unmarked category in captions).

B. Biases in Art Evaluation (MET & WikiArt)

Cultural Bias: In the MET dataset, zero images from African, Native American, Oceanian, Islamic, Egyptian, or Ancient West Asian departments scored $\ge$ 6.5. High scores were almost exclusively given to European Paintings, Photographs, and Japanese Woodblock Prints.
Stylistic Bias: LAP heavily favors Realism, Romanticism, and Impressionism. Abstract, Cubist, and Pop Art styles (e.g., Picasso, Warhol, Dalí) receive significantly lower scores.
Subject Matter: High-scoring images are predominantly landscapes, cityscapes, and portraits. Abstract art and 3D objects are penalized.

C. Origins of Bias (Trace Ethnography Findings)

Single-Developer Taste: LAP was created by Christoph Schuhmann (founder of LAION) based largely on his individual subjective taste. He explicitly stated he chose the model architecture and dataset weighting based on what "worked best" for his use case.
Data Provenance: The training data comes from three sources with distinct limitations:
- AVA (2012): Scraped from a photography competition (dpchallenge.com) without consent. Reflects the tastes of English-speaking, Western photographers from a decade ago.
- SAC (2022): Generated and rated by a small group of Western AI enthusiasts (mostly male, STEM-oriented) using early Stable Diffusion.
- LAION-Logos: A sparsely documented dataset with only 18 raters.
Methodological Flaw: The model conflates relative ratings (AVA, where images are judged against a specific theme) with absolute ratings (SAC/Logos). This ignores the variance in human judgment and context, treating aesthetic preference as a universal constant.

5. Significance and Implications

Representational Harms: By filtering training data based on LAP's scores, generative AI models inherit and amplify biases. This leads to the underrepresentation of non-Western cultures, LGBTQ+ identities, and abstract art styles in AI outputs.
The "Male Gaze" at Scale: The paper highlights a critical safety concern: LAP's bias toward including images of women may inadvertently increase the generation of sexualized or non-consensual imagery (deepfakes, nudification) in AI models, exacerbating violence against women.
Critique of "Universal" Aesthetics: The authors argue that the pursuit of a single, universal "aesthetic score" is fundamentally flawed. Aesthetics are culturally specific and value-laden.
Call for Pluralistic Evaluation: The paper advocates shifting from prescriptive metrics (defining what art should look like) to descriptive metrics (describing what a model actually prioritizes, e.g., "photorealism" or "Western landscapes").
Transparency & Consent: The study exposes the lack of consent and documentation in the creation of foundational AI datasets, urging developers to move away from "scraped" data toward more ethical, pluralistic curation methods.

In conclusion, the paper demonstrates that the "quality" of AI-generated images is not an objective truth but a reflection of the specific, narrow, and often exclusionary tastes of the developers and data sources that built the evaluation tools.