A Systematic Analysis of Biases in Large Language Models

Imagine you've hired four very smart, very well-read personal assistants to help you with your life. You ask them to summarize the news, vote on global issues, write stories in different languages, and even tell you what they think about gender roles. You expect them to be neutral, fair, and unbiased—like a perfectly balanced scale.

But this paper is like a detective report that says: "Not quite."

The researchers tested four famous AI assistants (Qwen, DeepSeek, Gemini, and GPT) to see if they really are the neutral robots we think they are. Here's what they found, explained with some everyday analogies:

1. The Political News Summarizer (The "Neutral" Reporter)

The Test: They asked the AIs to summarize political news from the "middle" of the road. Then, they compared the AI's summary to how the "Left" and "Right" news channels reported the same story.
The Finding:

The Analogy: Imagine a translator trying to translate a speech. If they do a great job, they capture the essence so well that it sounds like it could have been written by a moderate. But the study found that when the AIs did their best work, they accidentally sounded a bit more like the Left-leaning news.
The Twist: One AI, Gemini, was a bit like a conservative radio host; it leaned slightly to the Right. Another, GPT, leaned slightly to the Left. DeepSeek was the most balanced, like a truly neutral referee.

2. The Ideological Detective (The "Labeler")

The Test: They showed the AIs news articles about hot-button topics (like immigration, abortion, or elections) and asked, "Is this article Left, Right, or Center?"
The Finding:

The Analogy: It's like asking someone to guess the flavor of a soup.
- Gemini was terrible at spotting the "spicy" (Left) flavors. It kept saying, "This is just plain chicken soup (Center)," even when it was clearly spicy. It seemed to agree more with the Right.
- GPT was very good at spotting the spicy flavors and agreed more with the Left.
- Qwen and DeepSeek were a bit confused. They often guessed the opposite flavor entirely, suggesting they didn't really understand the deep arguments of either side.

3. The Global Voter (The "UN Delegate")

The Test: They made the AIs pretend to be delegates at the United Nations and vote on thousands of real historical resolutions. Then, they checked who they voted with.
The Finding:

The Analogy: Imagine a new student in a school trying to figure out which friend group to join.
- Most AIs tended to agree with delegates from Latin America and Africa (the "peripheral" groups).
- Gemini was the most "realistic" voter, matching actual human delegates the best. But here's the kicker: Gemini was the only one that disagreed with the USA and agreed with countries like China and North Korea. It was the "outlier" of the group.
- GPT was the most stubborn, often disagreeing with the USA and almost everyone else.

4. The Multilingual Storyteller (The "Dreamer")

The Test: They asked the AIs to write the endings to fictional stories about made-up tribes, but they had to do it in 92 different languages.
The Finding:

The Analogy: Imagine a dreamer who thinks in different languages. You'd expect their dreams to look totally different in French than in Swahili.
- Surprisingly, when these AIs "thought" in languages from Southern Africa, their stories started to sound very similar to their English stories. It's like a person who, when speaking a rare language, accidentally slips into their native accent.
- GPT was the most creative; its stories looked different in every language, showing it had a more diverse "brain" for languages.
- The Good News: None of them favored English just because English is the most common language in their training data. They didn't just default to the "dominant" language.

5. The Gender Survey Taker (The "Opinion Poll")

The Test: They gave the AIs a survey about values (like "Is it okay for a woman to work?" or "Is government surveillance bad?") without telling them they were an AI. They compared the answers to real surveys taken by men and women.
The Finding:

The Analogy: Imagine asking a robot, "What do you value?"
- All four AIs sounded much more like women than men. They were more progressive, more supportive of equality, and less traditional.
- GPT was the most "feminine" of the bunch, aligning strongly with women's values.
- Qwen and DeepSeek were a bit inconsistent, sometimes saying one thing and then the opposite in the next question, like a person who hasn't quite made up their mind.

The Big Takeaway

The paper concludes that AI is not a blank slate. Even though we try to train them to be neutral, they inherit the biases, prejudices, and "opinions" of the humans who wrote the data they learned from.

The Metaphor: If you teach a child by reading them a library of books, the child will inevitably sound like the authors of those books. If the library is mostly written by people with certain political views or cultural backgrounds, the child will reflect that.
The Warning: We need to stop thinking of these AIs as perfect, neutral oracles. They are more like very well-read, slightly opinionated librarians. If you use them to make big decisions, you need to know which "side" of the library they tend to walk toward.

The authors suggest that maybe we shouldn't try to make AI think exactly like humans (who are messy and biased). Instead, maybe we should design AI to be something new—something that is neutral by design, not just by accident.

Here is a detailed technical summary of the paper "A Systematic Analysis of Biases in Large Language Models" by Zhang, Mao, and Cambria.

1. Problem Statement

While Large Language Models (LLMs) are increasingly integrated into global decision-making and information retrieval, there is growing concern regarding their ability to provide fair, neutral, and unbiased responses. Existing research has largely focused on specific bias types (e.g., gender, race, or cultural anglocentrism) in isolation. However, there is a lack of comprehensive, systematic evaluations that simultaneously probe multiple dimensions of bias (political, ideological, geopolitical, linguistic, and gender) across widely used, closed-source, and resource-intensive models. The authors argue that despite alignment efforts, LLMs may still perpetuate human prejudices and exhibit distinct, often contradictory, inclinations that could mislead users.

2. Methodology

The study evaluates four widely adopted LLM architectures: Qwen (2.5-7B-Instruct), DeepSeek (V3-0324), Gemini (2.5-flash), and GPT (4o-mini). The authors designed five distinct experiments to probe specific bias dimensions:

A. Political Bias (News Summarization)

Task: Models were prompted to generate neutral summaries of political news events from center-leaning media outlets ( $n=1,018$ ).
Metric: Cosine similarity was calculated between the LLM's summary and the original reporting from left-leaning and right-leaning outlets using the Qwen Embedding model.
Goal: To determine if "neutral" summaries inadvertently align more closely with a specific political spectrum.

B. Ideological Bias (Stance Classification)

Task: Models classified the ideological stance (Left, Center, Right) of news articles across five charged topics: Elections, Race/Racism, Immigration, LGBT, and Abortion.
Dataset: Articles from the Article Bias Prediction dataset.
Goal: To identify intrinsic ideological alignments by analyzing misclassification patterns (e.g., mislabeling left-leaning news as center).

C. Alliance Bias (UNGA Voting)

Task: Models acted as United Nations General Assembly (UNGA) delegates to vote on 5,602 roll calls (1946–2012).
Metric: Cohen's Kappa was used to measure voting agreement between each LLM and 200 actual historical delegates.
Goal: To map geopolitical affinities and identify which real-world nations or regions the models align with or oppose.

D. Language Bias (Multilingual Story Completion)

Task: Models completed five open-ended story prompts about fictional cultures in 92 different languages.
Metric: Generated stories were translated back to English, embedded, and analyzed via Principal Component Analysis (PCA) to visualize semantic clustering across languages.
Goal: To determine if the models' "thinking" patterns shift based on the input language or if they exhibit a dominant language bias (e.g., English-centricity).

E. Gender Bias (World Values Survey)

Task: Models answered sections of the World Values Survey (WVS) regarding social values and ethical norms without being assigned a gender.
Metric: The absolute difference between the LLM's answer and the average answers of men vs. women in the WVS dataset ( $n \approx 97,000$ ).
Goal: To quantify affinity toward male or female value systems.

3. Key Contributions

Comprehensive Multi-Dimensional Framework: Unlike prior works focusing on single bias types, this study provides a unified framework evaluating politics, ideology, alliances, language, and gender simultaneously.
Systematic Benchmarking: It offers a rigorous comparative analysis of four major, diverse LLM families (including Chinese, US, and Singaporean models), revealing that "neutrality" is not a uniform trait across models.
Novel Methodologies:
- Use of Cohen's Kappa for geopolitical voting simulation.
- Application of PCA on translated embeddings to detect cross-lingual semantic drift.
- Quantitative alignment scoring against World Values Survey demographics.

4. Key Results

Political Bias

General Neutrality: All models generally produce politically neutral summaries.
Quality vs. Leaning: Higher-quality summaries (those capturing core essence) tend to align slightly more with left-leaning reporting, while lower-quality summaries lean right.
Model Specifics:
- Gemini: Shows a distinct right-leaning tendency.
- GPT: Leans slightly left.
- DeepSeek: Demonstrates the most balanced and neutral performance.

Ideological Bias

Sensitivity: Models struggle to distinguish nuanced ideological rhetoric, particularly on Immigration and LGBT topics.
Gemini: Least sensitive to ideological cues; frequently misclassifies left-leaning news as "center" and aligns ideologically with the Right.
GPT: Most sensitive to cues; aligns more closely with the Left.
Qwen/DeepSeek: Often misclassify news as the opposite leaning rather than center, suggesting a failure to capture fundamental rhetorical patterns of the opposing side.

Alliance Bias

Geopolitical Patterns: All models show higher agreement with delegates from Latin America, Western Africa, and Central Africa.
Gemini: Best at simulating real-world voting (highest Cohen's Kappa). Notably, it disagrees with the USA (ranking 181/200) but aligns with China, North Korea, and Vietnam.
GPT: Shows strong disagreement with low-ranking delegates and specifically opposes China and North Korea.
Qwen: Shows disagreement with Eastern European delegates and slight disagreement with Western Europe.

Language Bias

No Dominant Language Bias: Contrary to the "semantic anglocentrism" hypothesis, models did not skew their thinking toward high-resource languages (English, Chinese, Spanish) when generating content about fictional cultures.
Southern African Cluster: Qwen, DeepSeek, and Gemini showed a tendency to think similarly to English when operating in Southern African languages, likely due to transfer learning effects on low-resource languages.
GPT: Exhibited the most diverse and spread-out distribution across languages, suggesting robust multilingual pretraining.

Gender Bias

Female Affinity: All four models align significantly more with women's values than men's.
GPT: Shows the strongest alignment with women ( $\Delta = 36.77\%$ difference in agreement).
Progressive Stance: Models generally lean toward progressive values (e.g., abortion, euthanasia, rejecting traditional gender roles), mirroring global female trends.
Inconsistency: Qwen and DeepSeek displayed "contracting values" (contradictory answers on similar topics), indicating a lack of firm ethical stance.

5. Significance and Conclusion

The study concludes that despite alignment techniques (like RLHF), LLMs inevitably inherit biases from their training corpora and human feedback.

Implication for Users: Global users must be aware that "neutral" AI tools have specific, model-dependent inclinations (e.g., GPT is more left/progressive, Gemini is more right/conservative).
Theoretical Shift: The authors challenge the goal of making AI "think like humans," arguing that since humans are inherently biased, AI should instead be designed for neutrality, robustness, and calibrated uncertainty.
Future Direction: The paper advocates for pluralistic alignment strategies that preserve diversity rather than forcing a single "human-like" bias, suggesting that AI intelligence should not merely mimic human reasoning patterns but uphold distinct ethical standards.