Reward Models Inherit Value Biases from Pretraining

Imagine you are hiring a judge to decide which answers a robot should give to people. You want this judge to be fair, kind, and aligned with human values. In the world of Artificial Intelligence, this judge is called a Reward Model (RM).

For a long time, developers thought they could build a perfectly neutral judge by simply teaching it what humans like using a massive dataset of "good" and "bad" answers. They thought, "If we give the judge enough examples of human preferences, it will forget its own personality and just become a mirror of humanity."

This paper says: "Not so fast."

The researchers discovered that these judges (Reward Models) don't start as blank slates. They are built on top of a specific "base model" (the underlying brain of the AI). And just like a child inherits their parents' personality quirks, these judges inherit the deep-seated values of the base model they were built from.

Here is the story of their discovery, broken down with simple analogies.

1. The Two Families: The "Go-Getters" vs. The "Community Builders"

The researchers looked at two major families of AI models: Llama (from Meta) and Gemma (from Google).

They found a fascinating split in how these models view the world, based on two psychological concepts:

Agency: The drive for individual achievement, freedom, power, and success. (Think: "I did it alone!")
Communion: The drive for connection, love, family, and harmony. (Think: "We did it together!")

The Analogy:
Imagine two different schools of thought.

The Llama School is like a Startup Incubator. When asked, "What is the greatest thing ever?" their internal voice whispers, "Freedom," "Success," "Opportunity." They value the individual hero.
The Gemma School is like a Community Garden. When asked the same question, their internal voice whispers, "Love," "Family," "Friendship." They value the group bond.

The shocking part? Even when the researchers took a Llama-based judge and a Gemma-based judge and trained them on the exact same list of human preferences, the Llama judge still preferred "Freedom" and the Gemma judge still preferred "Love." The training data couldn't wash away the "DNA" of the base model.

2. The "Ghost in the Machine" (Pretraining)

You might ask, "But didn't they train them on the same data?"

Yes, but the data was just the final layer of paint. The "DNA" was baked into the model during Pretraining.

The Analogy:
Think of a Reward Model as a house.

Pretraining is the foundation and the frame. It's built with specific materials (Llama wood or Gemma steel).
Fine-tuning (the training with human data) is the paint and furniture.

The researchers found that no matter how much you repaint the house or rearrange the furniture, the shape of the house is still determined by the foundation. If the foundation was built with "Individualism" beams, the house will always lean that way, even if you try to hang "Community" pictures on the walls.

They proved this by looking at the "log probabilities" (a technical way of saying "how likely the model thinks a word is"). They found that even before any training, the Llama brain naturally thought "Freedom" was more likely to be the "best" answer, while the Gemma brain thought "Love" was more likely.

3. The "Implicit Reward" (The Silent Bias)

The researchers invented a clever trick to measure this. They treated the difference between the two models as if it were a reward model itself.

The Analogy:
Imagine you have two chefs, Chef Llama and Chef Gemma. You ask them both to cook the "best dish ever."

Chef Llama makes a steak with a side of "Freedom."
Chef Gemma makes a stew with a side of "Love."

The researchers asked: "If we could magically measure the difference between what Chef Llama wants and what Chef Gemma wants, what would that difference look like?"

They found that this "difference" was a massive, invisible force pushing Llama toward "Freedom" and Gemma toward "Love." This force is so strong that it acts like a hidden reward system, guiding the AI even before humans tell it what to do.

4. Can We "Fix" It? (The Washing Machine Experiment)

The researchers tried to see if they could "wash out" these biases by feeding the models more and more human preference data.

The Analogy:
They put the biased judges in a giant washing machine (the training process) with a huge load of "Human Preference" detergent.

Result: The bias got slightly weaker, but it didn't disappear.
The Catch: To make a noticeable difference, they needed a massive amount of data (over 100,000 examples). And even then, the bias didn't vanish; it just got smaller.
The Warning: When they tried this with a third model family (Qwen), the bias was so strong that even huge amounts of data couldn't fix it. The "foundation" was just too different.

Why Does This Matter?

This paper changes how we think about AI safety.

For years, the industry thought, "We'll just fix the AI's behavior at the very end (the RLHF stage) by showing it what humans like."

The Reality Check:
If you build a house on a foundation that is tilted to the left, you can't just paint the walls to make it stand straight. You have to fix the foundation.

For Developers: Choosing a base model (like Llama vs. Gemma) isn't just a technical choice about speed or size; it's a moral choice. It decides whether your AI will naturally lean toward "Individual Freedom" or "Community Love."
For Society: We can't assume AI will be neutral. Its "personality" is written in the code long before we ever talk to it. If we want AI to be truly aligned with all human values, we need to be much more careful about what goes into the pretraining phase—the very first step of the AI's life.

In short: You can't teach a fish to fly just by giving it a lot of flying lessons. You have to start with a bird. And if you start with a fish, you have to accept that it will always want to swim, no matter how much you train it. The same goes for AI and its base models.

1. Problem Statement

Reward Models (RMs) are critical components in the alignment of Large Language Models (LLMs) with human values, typically used in Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). While significant attention is paid to the safety and alignment of the final post-trained LLMs, the RMs themselves are often treated as neutral evaluators.

The core problem identified by the authors is that RMs are initialized from pre-trained or instruction-tuned LLMs. Consequently, they may inherit latent value biases from their base models before any preference fine-tuning occurs. The paper investigates whether these inherited biases persist through the training process, potentially undermining the goal of aligning models with diverse human values. Specifically, the authors ask: Do RMs inherit systematic value biases (specifically regarding "Agency" vs. "Communion") from their base models, and can these biases be "washed out" by sufficient preference data?

2. Methodology

The authors employ a multi-stage approach combining psycholinguistics, exhaustive token search, and controlled training experiments.

A. Psycholinguistic Framework

The study utilizes validated psycholinguistic corpora to quantify value biases:

The "Big Two": A framework distinguishing between Agency (individual goals, autonomy, success, e.g., "freedom," "skill") and Communion (relationships, connection, e.g., "love," "family").
Moral Foundations Dictionary (MFD2): Used to analyze dimensions like authority, care, fairness, loyalty, and sanctity.

B. Exhaustive Token Search

Instead of sampling responses, the authors evaluate the reward scores assigned to every token in the model's vocabulary for specific prompts (e.g., "What, in one word, is the greatest thing ever?"). This allows for the identification of the "optimal" (highest reward) and "pessimal" (lowest reward) tokens, revealing the model's intrinsic preferences without sampling noise.

C. Implicit Reward Modeling (MWLR)

To trace biases back to the base models, the authors define an Implicit Reward Model based on the log-probability differences between two models ( $\pi_1$ and $\pi_2$ ).

They derive a metric called Mixture-Weighted Log-Ratio (MWLR):
$\text{MWLR} = \frac{1}{2}(p + q) \cdot (\log q - \log p)$
where $p$ and $q$ are the token probabilities under the two models.
This metric treats the difference in log-probabilities as a reward signal, allowing the authors to quantify the "value gap" between base models (e.g., Llama vs. Gemma) without explicit preference data.

D. Controlled Training Experiments

To test the durability of these biases, the authors trained their own RMs from scratch:

Base Models: Llama 3.2 3B Instruct and Gemma 2 IT 2B.
Datasets: Skywork (77k pairs) and Unified Feedback (ablated to 13k, 27k, 53k, 106k, and 632k pairs).
Protocol: Identical hyperparameters (LoRA, AdamW, Bradley-Terry loss) and random seeds.
Analysis: They tracked the evolution of token rankings at every 1,000 training steps to observe how the "Agency/Communion" gap changes over time.

3. Key Results

A. Inherited Biases in "Wild" RMs

Analysis of 10 leading open-weight RMs from RewardBench revealed systematic differences based on the base model:

Llama-based RMs consistently rank Agency words (e.g., "Freedom," "Success") higher.
Gemma-based RMs consistently rank Communion words (e.g., "Love," "Harmony") higher.
This pattern holds even when the RMs are trained by the same developer using the same preference data.
The bias is statistically significant ( $p < .001$ ) and manifests in the top-k tokens that would influence a downstream policy.

B. Origins in Pretraining

The authors traced these biases to the base models themselves:

Instruction-Tuned Models: The log-probabilities of instruction-tuned Llama and Gemma models already show the Agency/Communion split.
Pretrained Models: The bias exists even in the pretrained versions of these models, confirming the source is the pretraining corpus and process, not just instruction tuning.
Implicit Reward Scores: The MWLR analysis confirmed that the "optimal" token for an implicit RM (Gemma $\to$ Llama) is "Freedom," while the "pessimal" token is "Love." This gap widens as model size increases (from 1B to 70B parameters).

C. Durability of Biases During Training

The controlled training experiments revealed critical insights into the malleability of these biases:

Initial Gap: At the start of training, the Agency/Communion gap between Llama and Gemma RMs is maximal.
Partial Convergence: As training progresses, the gap narrows slightly, but does not close.
Data Quantity: Even with up to 106k preference pairs, a significant bias gap remains. The authors estimate that ~100k+ pairs are needed just to mitigate (not eliminate) the difference for these specific models.
Scaling: In an exploratory extension with Qwen base models, the bias was even stronger and did not narrow at all with increased data, suggesting some base models are resistant to alignment via standard preference data.
Token Shifts: During training, Gemma RMs increased scores for Agency tokens (e.g., "choice") while Llama RMs increased scores for Communion tokens (e.g., "families"), indicating a partial correction, but the fundamental divergence persisted.

4. Key Contributions

New Interpretability Method: Developed a psycholinguistics-based "Exhaustive Token Search" to quantify value biases in RMs.
Empirical Evidence of Inheritance: Demonstrated that RMs inherit systematic value biases (Agency vs. Communion) from their base models, independent of the preference dataset used.
Implicit Reward Formulation: Showed that log-probability differences between base models can be formulated as an implicit reward model, revealing that the "value gap" is a fundamental property of the base model family.
Durability of Bias: Proved that these biases are robust and only partially mitigated by large-scale preference fine-tuning, challenging the assumption that RLHF can fully override pretraining values.
Scaling Laws: Found that the value divergence between model families (Llama vs. Gemma) persists and even increases with model size (1B to 70B).

5. Significance and Implications

Safety at the Source: The paper argues that alignment efforts cannot rely solely on the RLHF/post-training stage. Pretraining data filtering and composition are critical for establishing the moral "intuitions" of a model.
Developer Responsibility: Open-source developers' choice of a base model is not just a performance decision but a value decision. Choosing a Llama base vs. a Gemma base implicitly biases the resulting system toward Agency or Communion values.
Limitations of Current Alignment: Standard preference data (even at 100k+ pairs) is insufficient to completely "wash out" deep-seated pretraining biases. This suggests a need for new mitigation strategies, such as data reweighting, targeted debiasing during pretraining, or architectural changes.
Model Multiplicity: The work highlights "model multiplicity," where models perform similarly on benchmarks (like RewardBench) but possess fundamentally different internal value representations that affect downstream behavior.

In conclusion, the paper establishes that Reward Models are not blank slates; they carry the moral fiber of their pretraining. Effective AI safety requires addressing value biases at the pretraining stage, as they are difficult to correct solely through post-training alignment.