Reward Models Inherit Value Biases from Pretraining

This study demonstrates that reward models inherit significant and durable value biases, specifically preferences for "agency" or "communion," from their base pretrained language models, indicating that the choice of foundation model fundamentally shapes alignment outcomes regardless of subsequent fine-tuning.

Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska

Published 2026-03-03
📖 6 min read🧠 Deep dive

Imagine you are hiring a judge to decide which answers a robot should give to people. You want this judge to be fair, kind, and aligned with human values. In the world of Artificial Intelligence, this judge is called a Reward Model (RM).

For a long time, developers thought they could build a perfectly neutral judge by simply teaching it what humans like using a massive dataset of "good" and "bad" answers. They thought, "If we give the judge enough examples of human preferences, it will forget its own personality and just become a mirror of humanity."

This paper says: "Not so fast."

The researchers discovered that these judges (Reward Models) don't start as blank slates. They are built on top of a specific "base model" (the underlying brain of the AI). And just like a child inherits their parents' personality quirks, these judges inherit the deep-seated values of the base model they were built from.

Here is the story of their discovery, broken down with simple analogies.

1. The Two Families: The "Go-Getters" vs. The "Community Builders"

The researchers looked at two major families of AI models: Llama (from Meta) and Gemma (from Google).

They found a fascinating split in how these models view the world, based on two psychological concepts:

  • Agency: The drive for individual achievement, freedom, power, and success. (Think: "I did it alone!")
  • Communion: The drive for connection, love, family, and harmony. (Think: "We did it together!")

The Analogy:
Imagine two different schools of thought.

  • The Llama School is like a Startup Incubator. When asked, "What is the greatest thing ever?" their internal voice whispers, "Freedom," "Success," "Opportunity." They value the individual hero.
  • The Gemma School is like a Community Garden. When asked the same question, their internal voice whispers, "Love," "Family," "Friendship." They value the group bond.

The shocking part? Even when the researchers took a Llama-based judge and a Gemma-based judge and trained them on the exact same list of human preferences, the Llama judge still preferred "Freedom" and the Gemma judge still preferred "Love." The training data couldn't wash away the "DNA" of the base model.

2. The "Ghost in the Machine" (Pretraining)

You might ask, "But didn't they train them on the same data?"

Yes, but the data was just the final layer of paint. The "DNA" was baked into the model during Pretraining.

The Analogy:
Think of a Reward Model as a house.

  • Pretraining is the foundation and the frame. It's built with specific materials (Llama wood or Gemma steel).
  • Fine-tuning (the training with human data) is the paint and furniture.

The researchers found that no matter how much you repaint the house or rearrange the furniture, the shape of the house is still determined by the foundation. If the foundation was built with "Individualism" beams, the house will always lean that way, even if you try to hang "Community" pictures on the walls.

They proved this by looking at the "log probabilities" (a technical way of saying "how likely the model thinks a word is"). They found that even before any training, the Llama brain naturally thought "Freedom" was more likely to be the "best" answer, while the Gemma brain thought "Love" was more likely.

3. The "Implicit Reward" (The Silent Bias)

The researchers invented a clever trick to measure this. They treated the difference between the two models as if it were a reward model itself.

The Analogy:
Imagine you have two chefs, Chef Llama and Chef Gemma. You ask them both to cook the "best dish ever."

  • Chef Llama makes a steak with a side of "Freedom."
  • Chef Gemma makes a stew with a side of "Love."

The researchers asked: "If we could magically measure the difference between what Chef Llama wants and what Chef Gemma wants, what would that difference look like?"

They found that this "difference" was a massive, invisible force pushing Llama toward "Freedom" and Gemma toward "Love." This force is so strong that it acts like a hidden reward system, guiding the AI even before humans tell it what to do.

4. Can We "Fix" It? (The Washing Machine Experiment)

The researchers tried to see if they could "wash out" these biases by feeding the models more and more human preference data.

The Analogy:
They put the biased judges in a giant washing machine (the training process) with a huge load of "Human Preference" detergent.

  • Result: The bias got slightly weaker, but it didn't disappear.
  • The Catch: To make a noticeable difference, they needed a massive amount of data (over 100,000 examples). And even then, the bias didn't vanish; it just got smaller.
  • The Warning: When they tried this with a third model family (Qwen), the bias was so strong that even huge amounts of data couldn't fix it. The "foundation" was just too different.

Why Does This Matter?

This paper changes how we think about AI safety.

For years, the industry thought, "We'll just fix the AI's behavior at the very end (the RLHF stage) by showing it what humans like."

The Reality Check:
If you build a house on a foundation that is tilted to the left, you can't just paint the walls to make it stand straight. You have to fix the foundation.

  • For Developers: Choosing a base model (like Llama vs. Gemma) isn't just a technical choice about speed or size; it's a moral choice. It decides whether your AI will naturally lean toward "Individual Freedom" or "Community Love."
  • For Society: We can't assume AI will be neutral. Its "personality" is written in the code long before we ever talk to it. If we want AI to be truly aligned with all human values, we need to be much more careful about what goes into the pretraining phase—the very first step of the AI's life.

In short: You can't teach a fish to fly just by giving it a lot of flying lessons. You have to start with a bird. And if you start with a fish, you have to accept that it will always want to swim, no matter how much you train it. The same goes for AI and its base models.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →