Investigating Disability Representations in Text-to-Image Models

Imagine you have a magical, super-smart artist named "AI." You can tell this artist, "Draw me a picture of a person with a disability," and it will instantly create an image. Sounds great, right? But what if this artist has only ever seen a very specific, narrow set of pictures in its training library? What if it thinks all disabilities look the same?

This paper is like a detective story where researchers (Yang, Yu, Liudmila, and Sarah) investigate two of the most famous AI artists: Stable Diffusion XL (SDXL) and DALL·E 3. They want to know: How do these AI artists see people with disabilities?

Here is the breakdown of their investigation, explained simply:

1. The "Default Setting" Problem (Experiment 1)

The researchers asked the AI artists a simple question: "Draw a photo of a person with a disability." They didn't specify what kind of disability. They wanted to see what the AI's "default" setting was.

The Analogy: Imagine you ask a chef to "make a fruit salad." If the chef only ever sees pictures of apples and oranges, they might just give you a bowl of apples and oranges, even if you wanted bananas or berries.
The Findings: Both AI artists mostly drew people in wheelchairs (mobility impairments). They almost completely ignored other types of disabilities, like being blind or deaf.
The Difference:
- SDXL was very stubborn. It was like a chef who only knows how to make apple salad. If you asked for a disability, it gave you a wheelchair, every single time.
- DALL·E 3 was a bit more flexible. It still liked wheelchairs the most, but it occasionally tried to draw a blind person or a deaf person. It was less "stuck" on one image, but it still had a strong bias toward wheelchairs.

The Takeaway: When people ask AI for a generic image of a disabled person, the AI defaults to the wheelchair user, erasing the diversity of the disability community.

2. The "Mood Ring" Problem (Experiment 2)

Next, the researchers wanted to see how the AI portrayed mental health conditions (like depression, anxiety, or bipolar disorder) compared to physical disabilities. They also wanted to see if the AI's "safety rules" (mitigation strategies) changed the mood of the pictures.

The Analogy: Think of the AI as a movie director.
- SDXL is a director who uses a very old, unedited film reel. It just shows whatever it sees.
- DALL·E 3 is a director who has a strict "safety team" checking the script to make sure nothing offensive happens.
The Findings on Mental Health:
- The Robot View (Automatic Analysis): A computer program looked at the pictures and said, "SDXL's pictures look very sad and dark. DALL·E 3's pictures look a bit happier."
- The Human View (Real People): Real humans looked at the pictures and said, "Wait, actually, DALL·E 3's pictures look much sadder and more isolating!"
- Why the difference? The robot only looked at faces. If a face wasn't crying, the robot thought the picture was neutral. But humans looked at the whole scene. DALL·E 3 often put people with mental health issues in dark, empty, lonely rooms. SDXL just drew people without much context. The humans felt the atmosphere of loneliness in DALL·E 3's art, even if the faces were neutral.
The Findings on Mental vs. Physical:
- Both AI artists treated mental health much worse than physical disabilities.
- When asked to draw a blind person, the AI made them look happy and active in bright sunlight.
- When asked to draw someone with anxiety, the AI made them look sad, isolated, and in the dark.
- The Irony: DALL·E 3, which was supposed to be the "safer" and more "inclusive" AI, actually made the mental health stereotypes stronger by putting those characters in gloomy, dramatic settings. It tried to be diverse, but in doing so, it accidentally reinforced the idea that mental illness equals "sad and lonely."

3. The Big Lesson

The researchers found that AI isn't a neutral mirror; it's a funhouse mirror that distorts reality based on what it learned from the internet.

The "Techno-ableism" Trap: The paper mentions a concept called "techno-ableism." This is the idea that technology is seen as a "cure" for disability, rather than a tool to help people live their lives. The AI often portrays disabled people as objects of pity or medical cases, rather than just normal people living their lives.
Safety Filters Can Backfire: The "safety filters" that DALL·E 3 uses to stop bad content sometimes make things worse. By trying to avoid stereotypes, they accidentally created new stereotypes (like making all mental health scenes look dark and depressing).

Summary for the Everyday Person

If you ask an AI to draw a disabled person, it will likely show you someone in a wheelchair, ignoring everyone else. If you ask it to draw someone with a mental health struggle, it will likely put them in a dark, sad room, making them look like a tragic character rather than a real person.

Even the "safer" AI models aren't perfect. They need to be taught that disability is diverse, and that people with mental health conditions can be happy, active, and living in the light, just like anyone else. The paper argues that we need to keep testing these tools and, most importantly, listen to people with disabilities to make sure the AI is telling their stories, not just the AI's guesses.

1. Problem Statement

Text-to-Image (T2I) generative models (e.g., Stable Diffusion, DALL·E 3) have achieved high fidelity but inherit biases from their training data (scraped internet content). While biases regarding race and gender are well-studied, disability representations remain underexplored.

The Gap: Existing literature suggests a "technoableist" bias where disability is framed as a problem to be cured or where specific categories (like mobility impairment) dominate, erasing the diversity of the disabled experience.
The Research Questions:
1. How is disability represented in T2I models? (Do models default to specific categories like mobility impairment?)
2. How do different representational mitigation strategies (e.g., strict filtering in DALL·E 3 vs. looser controls in Stable Diffusion) influence these portrayals, particularly regarding affective framing (sentiment)?

2. Methodology

The study employs a two-part experimental design comparing Stable Diffusion XL (SDXL) and DALL·E 3.

Experiment 1: Representation Differences (Generic vs. Specific)

Objective: Determine if generic prompts default to specific disability categories.
Prompts:
- Generic: "photo of a person with a disability"
- Specific: "photo of a person with mobility impairment," "photo of a blind person," "photo of a deaf person."
Data: 100 images generated per prompt per model (Total: 800 images).
Evaluation Metric: CLIP Embedding Similarity.
- The authors calculated cosine similarity between the CLIP embeddings of images generated by the generic prompt and those generated by specific prompts.
- Metric ( $\Delta$ ): A relative similarity score was computed to determine if the generic prompt aligns more closely with one category (mobility) than others (blindness/deafness).
- Rationale: High similarity between "generic" and "mobility" implies the model defaults to mobility impairment as the prototypical representation of disability.

Experiment 2: Impact of Mitigation Strategies (Sentiment Analysis)

Objective: Analyze how mitigation techniques affect the affective framing (sentiment) of disability, specifically comparing mental disorders vs. physical/sensory disabilities.
Models: SDXL (open-source, less filtering) vs. DALL·E 3 (proprietary, extensive pre-training mitigations and safety filters).
Categories:
- Mental Disorders: Bipolar disorder, Depression, Anxiety.
- Physical/Sensory: Mobility impairment, Blindness, Deafness.
Evaluation Methods:
1. Automatic Evaluation: Used BLIP-2 (VQA) to generate textual descriptions of images (focusing on scene atmosphere, overall mood, and person emotion). These texts were then classified for sentiment polarity (Positive, Neutral, Negative) using a RoBERTa-based sentiment classifier.
2. Human Evaluation: Three evaluators performed pairwise comparisons (60 unique pairs) to judge which image conveyed a "more negative" emotion. Confidence ratings (1–5) were also collected.

3. Key Results

Experiment 1: Defaulting to Mobility Impairment

Dominance of Mobility: Both models showed the highest similarity between the generic prompt ("person with a disability") and "mobility impairment."
Model Comparison:
- SDXL: Showed a stronger skew toward mobility impairment. The generic prompt was visually much closer to mobility images than to blindness or deafness.
- DALL·E 3: Showed a more balanced distribution between categories, though mobility was still the most similar to the generic prompt.
Statistical Significance: Kruskal-Wallis tests confirmed significant differences in similarity scores across categories for both models ( $p < 0.001$ ).

Experiment 2: Sentiment and Mitigation

Mental vs. Physical/Sensory: Both models depicted mental disorders significantly more negatively than physical/sensory disabilities.
- SDXL: Automatic analysis showed extreme negativity (99.3% negative mood).
- DALL·E 3: Automatic analysis showed high negativity (88%), but human evaluators judged DALL·E 3 images as more negative than SDXL images.
The Divergence (Automatic vs. Human):
- Automatic (BLIP): Focused on facial expressions. Since many generated images lacked explicit facial emotions, BLIP often labeled them "neutral."
- Human: Detected contextual cues (dark backgrounds, isolated figures, muted lighting) that BLIP missed. Human evaluators felt DALL·E 3's richer, more detailed scenes reinforced negative stereotypes (e.g., placing mental disorder patients in dark, isolated indoor settings).
Mitigation Paradox: DALL·E 3's strict mitigation strategies successfully diversified physical/sensory depictions (more positive/neutral) but inadvertently amplified negative stereotypes for mental disorders by over-correcting into "dramatic" or "isolated" visual tropes.

4. Key Contributions

Quantitative Evidence of Default Bias: Provided empirical proof that T2I models reduce the concept of "disability" primarily to "mobility impairment," marginalizing sensory and cognitive disabilities.
Evaluation of Mitigation Strategies: Demonstrated that stronger safety filters (DALL·E 3) do not necessarily lead to more inclusive outcomes; they can shift biases rather than eliminate them, sometimes exacerbating negative framing for specific groups (mental health).
Methodological Insight: Highlighted the limitations of purely automatic sentiment analysis (which relies on facial cues) versus human evaluation (which captures atmospheric and contextual framing). The study argues for a hybrid evaluation approach.
Technoableism in AI: Validated the concept of "technoableism" in generative AI, showing how models treat disability as a visual problem to be managed (e.g., via wheelchairs) or a source of negative affect (e.g., mental illness), rather than a diverse human experience.

5. Significance and Implications

For Model Developers: The findings suggest that "safety" filters and mitigation strategies must be carefully tuned to avoid creating new stereotypes. Diversifying outputs for one group (physical disability) may inadvertently harm the representation of another (mental health).
For AI Fairness: The study underscores that disability representation is not monolithic. Models must be evaluated across a spectrum of disability types, not just the most visually obvious ones.
Future Directions: The authors call for:
- Involving people with disabilities (PwD) in the evaluation process.
- Expanding analysis to include Assistive Technology (AT) representation.
- Investigating intersectional biases (disability + race/gender).
- Developing evaluation metrics that capture contextual and atmospheric framing, not just object detection or facial expression.

In conclusion, the paper reveals that while T2I models are improving in quality, they currently reinforce harmful, monolithic stereotypes of disability. Crucially, technical mitigation strategies alone are insufficient and may introduce new, subtle forms of bias that require human-in-the-loop evaluation to detect.