Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends

Imagine the world of Artificial Intelligence (AI) as a massive, bustling library where computers learn to read, write, and understand human language. For a long time, this library was built mostly by a specific group of people, using books written in a specific way. As a result, the library's "rules" often accidentally left out, misunderstood, or even insulted people who didn't fit the standard mold—specifically, the LGBTQIA+ community.

This paper, "Queer NLP: A Critical Survey," is like a group of librarians and community members walking through that library together to take a hard look at the shelves. They aren't just checking for typos; they are asking: Who is missing from these stories? Whose voices are being muffled? And how can we fix the library so everyone feels welcome?

Here is a breakdown of their findings using simple analogies:

1. The "Reactive" vs. "Proactive" Problem

The Analogy: Imagine a factory that makes toys. For years, the factory made toys that broke easily for certain kids. The engineers kept waiting for a kid to break a toy, then they would say, "Oh no! That toy is broken!" and try to patch it up. They rarely asked, "How do we design a toy that works for everyone from the start?"

The Finding: The paper finds that most AI research on queer topics is reactive. Researchers are mostly busy pointing out where the AI is being mean or biased (like a broken toy) rather than building new, better systems that are inclusive by design. They are often playing "Whac-A-Mole" with bias instead of fixing the machine.

2. The "English-Only" Blindfold

The Analogy: Imagine the library only has books in English. If you speak Spanish, German, or Hindi, the librarians might try to translate your story into English, but in the process, they lose the flavor, the culture, and the specific meaning of your words.

The Finding: The survey discovered that 76% to 80% of the research focuses only on English. Even when researchers study other languages, they are usually just translating from English to another language. They are ignoring the rich, unique ways queer people express themselves in languages like Hindi, Swahili, or Arabic. It's like trying to understand a global party by only listening to the DJ in one corner.

3. The "Missing Guests" at the Dinner Party

The Analogy: Imagine planning a huge dinner party for a diverse group of people. You spend weeks cooking and setting the table, but you never actually invite the people you are cooking for to taste the food or tell you what they like. You just guess what they want.

The Finding: The paper highlights a massive gap: Stakeholder Involvement. Almost none of the studies actually invited LGBTQIA+ people to help design the AI or test it. Instead of asking the community, "How do you want to be described?", researchers often just guess or use computer metrics. It's like a chef cooking a meal without ever asking the diners if they are allergic to anything.

4. The "Rigid Boxes" vs. The "Fluid Rainbow"

The Analogy: Imagine trying to sort a box of colorful, shape-shifting jellybeans into rigid, pre-labeled jars: "Red," "Blue," and "Green." But these jellybeans can change color, mix colors, or be a color that doesn't exist in the jars yet. If you force them into the jars, you crush them or throw them away.

The Finding: AI systems often rely on binary thinking (Man vs. Woman, Straight vs. Gay). But queer identities are fluid and complex. The paper argues that AI is currently too rigid. It struggles with:

Pronouns: It gets confused by "they/them" or made-up pronouns like "ze/zir."
Context: It might think the word "gay" is an insult because it sees it in a hateful sentence, even if a queer person is using it happily in a different context.
Stereotypes: It assumes all gay people like certain things, just like it assumes all women like cooking.

5. The "Silent" Voices

The Analogy: Imagine a microphone that is very sensitive to loud, clear voices but cuts out anyone who speaks softly, uses slang, or speaks with an accent.

The Finding: The AI is bad at understanding queer speech.

Hate Speech Detection: The AI often flags normal queer slang as "toxic" or "hate speech" because it doesn't understand the context (like how a community might reclaim a word that was once an insult).
Voice Recognition: If a transgender person has a voice that doesn't match the AI's "male" or "female" training data, the computer might not even recognize them as speaking.

6. The Call to Action: Building a Better Library

The authors aren't just complaining; they are handing the community a blueprint for a new library. They suggest:

Invite the Guests: Let LGBTQIA+ people help build the AI. Don't just study them; work with them.
Break the Boxes: Stop forcing people into "Man/Woman" or "Straight/Gay" boxes. Build systems that understand the messy, beautiful spectrum of human identity.
Go Global: Stop focusing only on English. Build tools for Spanish, Hindi, Arabic, and every other language where queer people live.
Embrace "Refusal": Sometimes, the most powerful thing a queer person can do is say, "I don't want to be categorized by your system." The paper suggests AI should respect that choice, too.

The Bottom Line

This paper is a wake-up call. It says that while AI is getting smarter, it is still "blind" to the full spectrum of human identity. To make technology truly useful and safe for everyone, we need to stop patching up the old, broken systems and start building new ones that are designed with queer joy, complexity, and community at the very center.

Here is a detailed technical summary of the paper "Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends."

1. Problem Statement

Natural Language Processing (NLP) technologies are increasingly integrated into critical societal domains (hiring, healthcare, law), yet they often perpetuate harms against marginalized groups. Specifically, LGBTQIA+ communities face unique challenges in NLP systems, including:

Erasure and Misgendering: Systems failing to recognize non-binary identities or misgendering transgender individuals.
Toxicity Misclassification: Neutral or positive queer terms (e.g., "gay," "queer") being flagged as hate speech or toxic.
Stereotyping: Large Language Models (LLMs) reproducing harmful stereotypes and under-representing queer narratives.
Structural Gaps: A lack of stakeholder involvement, intersectional analysis, and linguistic diversity (heavily skewed toward English).

The paper argues that current research is largely reactive (identifying bias) rather than proactive (creating inclusive solutions), and often fails to engage with the queer community or queer theory frameworks.

2. Methodology

The authors conducted a systematic survey following the PRISMA methodology to analyze the state of "Queer NLP."

Data Collection:
- Source: The ACL Anthology (Association for Computational Linguistics).
- Search Strategy: Searched for papers containing specific keywords (e.g., queer, lgbt, trans, non-binary, etc.).
- Filtering: Initial search yielded 3,864 entries. Manual filtering removed irrelevant papers (e.g., authors named "Gay" or papers using "gay" as a linguistic example without engaging the community).
- Expansion: The authors expanded the list using scholarly knowledge and added 19 papers from the 2025 ACL conference, resulting in a final corpus of 86 papers.
Annotation Process:
- Papers were annotated along multiple axes: NLP task, subcategory, motivation, method, specific queer groups addressed, types of harm addressed, language used, geographical region, and stakeholder involvement.
- Inter-rater Reliability: 11 papers were double-annotated, achieving high agreement (90.09% overall; Cohen's Kappa = .79 and .62 for key themes).
Categorization: Papers were grouped into seven primary technical categories:
1. Language Models (Evaluation & Mitigation)
2. Text Classification
3. Information Extraction & Access
4. Machine Translation
5. Speech Recognition & Synthesis
6. Position Papers
7. (Implicitly) General Trends

3. Key Contributions

First Systematic Survey: This is the first comprehensive survey of all queer NLP papers within the ACL Anthology, covering diverse NLP tasks.
Critical Gap Analysis: The paper identifies that while research volume is growing, the field suffers from:
- Anglocentrism: Overwhelming focus on English (76.7% of papers).
- Lack of Intersectionality: Rarely addressing overlapping identities (e.g., race + gender + sexuality).
- Absence of Stakeholders: Very few studies involve queer community members in the design or evaluation process.
Queer Theory Integration: The authors introduce a perspective from Queer Theory, challenging the "stability" of identity categories in NLP and advocating for "critical refusal" (opting out of categorization) as a valid methodological stance.
Identification of "Hope Speech": Highlighting non-English research (e.g., in Spanish and Italian) that focuses on "hope speech" and reclaimed slurs, areas largely ignored in English-centric NLP.

4. Key Results & Findings

A. Research Trends

Dominant Topics: Research heavily focuses on pronouns (neopronouns, gender-neutral usage) and stereotypes.
Dominant Methods:
- Template-based approaches: Using templated sentences to elicit bias (e.g., "The [group] is [adjective]").
- Data Augmentation: Creating synthetic data to balance datasets, often failing to capture natural language nuance.
- Reactive Bias Mitigation: Most papers identify bias but offer limited scalable solutions for mitigation.

B. Technical Findings by Domain

Language Models (LLMs): Biases manifest as erasure, misgendering, and hate speech. Evaluation relies on intrinsic measures (perplexity, pseudo-log-likelihood) and extrinsic measures (sentiment analysis, toxicity detection).
Text Classification: Models frequently misclassify queer identity terms as toxic. While datasets for hate speech detection exist, they often lack nuance regarding reclaimed slurs or context.
Information Extraction: Coreference resolution struggles with non-binary pronouns. Chatbots for queer health (e.g., gender-affirming surgery advice) show gaps in reliability and adherence to medical guidelines.
Machine Translation: Significant challenges exist in translating gender-neutral concepts into gendered languages. Most work is English-centric (English $\to$ Romance/Germanic languages).
Speech Processing: A critical gap exists; few datasets contain queer voices. ASR systems are biased toward binary, cisgender voices. Recent work attempts to synthesize non-binary voices, but privacy concerns limit data sharing.

C. Identified Gaps

Stakeholder Involvement: Only a handful of papers (e.g., Felkner et al., 2023; Gromann et al., 2023) involve the community in the loop. Most rely on automated metrics rather than user feedback.
Intersectionality: Research rarely examines how race, class, and disability intersect with queer identity in NLP failures.
Language Diversity: The field is dominated by English. Non-English research (e.g., in Spanish, Italian, Portuguese) often explores topics like "hope speech" and reclaimed slurs that are absent in English literature.
Dynamic Evaluation: Current evaluations are static. They fail to account for the fluidity of queer language and the ability of users to "opt out" of being measured or categorized.

5. Significance and Future Directions

This survey serves as both a roadmap and a call to action for the NLP community.

Shift from Reactive to Proactive: The field must move beyond simply detecting bias to building systems that actively support queer communities.
Participatory Action Research: Future work should involve queer stakeholders as co-researchers, not just data subjects. This includes community-driven data collection.
Beyond Binary Frameworks: NLP systems must move away from cis-normative and binary gender assumptions, integrating insights from Queer Theory and Sociolinguistics.
Global Perspective: Researchers are urged to look beyond the Anglosphere, leveraging non-English conferences (e.g., IberLEF, EVALITA) to diversify the understanding of queer harms and solutions.
Embracing "Refusal": The paper suggests that NLP systems should accommodate "critical refusal"—the right of users to opt out of categorization or labeling—rather than forcing rigid taxonomies that erase fluid identities.

In conclusion, the paper asserts that for NLP to be truly inclusive, it must address structural power imbalances, involve marginalized communities in the design process, and embrace the fluidity of queer identities rather than attempting to force them into static, binary categories.