A Systematic Analysis of Biases in Large Language Models

This study systematically evaluates four widely adopted large language models across political, ideological, alliance, linguistic, and gender dimensions, revealing that despite alignment efforts for neutrality, these models still exhibit significant biases and affinities.

Xulang Zhang, Rui Mao, Erik Cambria

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you've hired four very smart, very well-read personal assistants to help you with your life. You ask them to summarize the news, vote on global issues, write stories in different languages, and even tell you what they think about gender roles. You expect them to be neutral, fair, and unbiased—like a perfectly balanced scale.

But this paper is like a detective report that says: "Not quite."

The researchers tested four famous AI assistants (Qwen, DeepSeek, Gemini, and GPT) to see if they really are the neutral robots we think they are. Here's what they found, explained with some everyday analogies:

1. The Political News Summarizer (The "Neutral" Reporter)

The Test: They asked the AIs to summarize political news from the "middle" of the road. Then, they compared the AI's summary to how the "Left" and "Right" news channels reported the same story.
The Finding:

  • The Analogy: Imagine a translator trying to translate a speech. If they do a great job, they capture the essence so well that it sounds like it could have been written by a moderate. But the study found that when the AIs did their best work, they accidentally sounded a bit more like the Left-leaning news.
  • The Twist: One AI, Gemini, was a bit like a conservative radio host; it leaned slightly to the Right. Another, GPT, leaned slightly to the Left. DeepSeek was the most balanced, like a truly neutral referee.

2. The Ideological Detective (The "Labeler")

The Test: They showed the AIs news articles about hot-button topics (like immigration, abortion, or elections) and asked, "Is this article Left, Right, or Center?"
The Finding:

  • The Analogy: It's like asking someone to guess the flavor of a soup.
    • Gemini was terrible at spotting the "spicy" (Left) flavors. It kept saying, "This is just plain chicken soup (Center)," even when it was clearly spicy. It seemed to agree more with the Right.
    • GPT was very good at spotting the spicy flavors and agreed more with the Left.
    • Qwen and DeepSeek were a bit confused. They often guessed the opposite flavor entirely, suggesting they didn't really understand the deep arguments of either side.

3. The Global Voter (The "UN Delegate")

The Test: They made the AIs pretend to be delegates at the United Nations and vote on thousands of real historical resolutions. Then, they checked who they voted with.
The Finding:

  • The Analogy: Imagine a new student in a school trying to figure out which friend group to join.
    • Most AIs tended to agree with delegates from Latin America and Africa (the "peripheral" groups).
    • Gemini was the most "realistic" voter, matching actual human delegates the best. But here's the kicker: Gemini was the only one that disagreed with the USA and agreed with countries like China and North Korea. It was the "outlier" of the group.
    • GPT was the most stubborn, often disagreeing with the USA and almost everyone else.

4. The Multilingual Storyteller (The "Dreamer")

The Test: They asked the AIs to write the endings to fictional stories about made-up tribes, but they had to do it in 92 different languages.
The Finding:

  • The Analogy: Imagine a dreamer who thinks in different languages. You'd expect their dreams to look totally different in French than in Swahili.
    • Surprisingly, when these AIs "thought" in languages from Southern Africa, their stories started to sound very similar to their English stories. It's like a person who, when speaking a rare language, accidentally slips into their native accent.
    • GPT was the most creative; its stories looked different in every language, showing it had a more diverse "brain" for languages.
    • The Good News: None of them favored English just because English is the most common language in their training data. They didn't just default to the "dominant" language.

5. The Gender Survey Taker (The "Opinion Poll")

The Test: They gave the AIs a survey about values (like "Is it okay for a woman to work?" or "Is government surveillance bad?") without telling them they were an AI. They compared the answers to real surveys taken by men and women.
The Finding:

  • The Analogy: Imagine asking a robot, "What do you value?"
    • All four AIs sounded much more like women than men. They were more progressive, more supportive of equality, and less traditional.
    • GPT was the most "feminine" of the bunch, aligning strongly with women's values.
    • Qwen and DeepSeek were a bit inconsistent, sometimes saying one thing and then the opposite in the next question, like a person who hasn't quite made up their mind.

The Big Takeaway

The paper concludes that AI is not a blank slate. Even though we try to train them to be neutral, they inherit the biases, prejudices, and "opinions" of the humans who wrote the data they learned from.

  • The Metaphor: If you teach a child by reading them a library of books, the child will inevitably sound like the authors of those books. If the library is mostly written by people with certain political views or cultural backgrounds, the child will reflect that.
  • The Warning: We need to stop thinking of these AIs as perfect, neutral oracles. They are more like very well-read, slightly opinionated librarians. If you use them to make big decisions, you need to know which "side" of the library they tend to walk toward.

The authors suggest that maybe we shouldn't try to make AI think exactly like humans (who are messy and biased). Instead, maybe we should design AI to be something new—something that is neutral by design, not just by accident.