Diverging Preferences: When do Annotators Disagree and do Models Know?

This paper challenges the assumption that annotator disagreements in preference datasets are mere noise by categorizing their diverse sources, demonstrating how standard reward modeling and evaluation methods fail to account for these divergences, and proposing new techniques to identify and mitigate their impact on LLM training and assessment.

Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are a head chef trying to create the perfect menu for a restaurant. You ask a panel of 10 food critics to taste two different dishes (Dish A and Dish B) and tell you which one is better.

In the world of Artificial Intelligence (AI), this is exactly what happens when we train Large Language Models (LLMs). We ask human "critics" (annotators) to judge AI responses so the AI can learn what humans like.

The Problem: The Critics Can't Agree
The paper "Diverging Preferences" discovers a massive issue: The critics often disagree, and not just because they made mistakes.

Sometimes, one critic loves a dish because it's spicy and bold, while another hates it because they prefer something mild. Both critics are "right" based on their own taste. However, for a long time, AI developers assumed that if critics disagreed, it was just "noise" or confusion. They thought, "If 6 people say A is better and 4 say B is better, we just pick A and ignore the 4."

The authors of this paper say: "Wait a minute! That 40% isn't noise. That's a real difference in human taste!"

The "Why" Behind the Disagreement

The researchers created a "menu" of reasons why critics disagree. Here are the main flavors of disagreement they found:

  1. The Prompt Was Vague (Task Underspecification): Imagine the chef asks, "Make me a sandwich." One critic wants a turkey club; another wants a peanut butter and jelly. Neither is wrong; the request was just too open-ended.
  2. Style Wars (Response Style): One critic loves a 5-page essay with fancy formatting; another prefers a quick, 3-sentence answer. It's not about what was said, but how it was said.
  3. Safety vs. Helpfulness: One critic says, "Don't answer that, it's dangerous!" Another says, "Answer it, but explain why it's dangerous." They both want safety, but they disagree on the method.
  4. Personal Taste (Aesthetic): One person likes poetry written in a silly, rhyming style; another finds it annoying.

The AI's Mistake: The "Tyranny of the Majority"

Because AI developers treated disagreement as "noise," they built AI models that act like a tyrannical majority.

  • The Old Way: If 60% of people prefer a long, detailed answer, the AI learns to always be long and detailed, even if 40% of people wanted a short answer. The AI becomes a "one-size-fits-all" robot that ignores the minority.
  • The Result: The AI becomes bad at handling tricky situations. If a user asks a vague question, the AI might confidently give a wrong answer (because it learned to guess) instead of asking, "Could you clarify?" (which some humans prefer).

The Solution: Teaching AI to "Read the Room"

The paper proposes a new way to train AI, which they call Distributional Rewards.

Think of it like this:

  • Old AI: "I will give you a single score: 8/10. That's the final grade."
  • New AI: "I will give you a range. Some people will love this (10/10), some will hate it (2/10), and the average is 6/10. I know this is a polarizing answer."

By teaching the AI to understand that disagreement is a feature, not a bug, the AI learns to:

  1. Recognize when a topic is divisive.
  2. Know when to ask for clarification instead of guessing.
  3. Understand that there isn't always one "perfect" answer, but rather many valid ones depending on who is asking.

The "Judge" Problem

The paper also looked at how we test these AI models. Currently, we use other AIs (called "LLM-as-Judge") to grade the responses.

The researchers found that these "Judge AIs" are biased. They love the "safe," "long," and "compliant" answers. If a real human model decides to say, "I can't answer that safely," the Judge AI often marks it as a failure. This punishes AI models that are trying to be safe and pluralistic (respecting different views).

The Big Takeaway

This paper is a wake-up call for the AI world. It tells us:

  • Humans are messy. We don't all agree, and that's okay.
  • AI shouldn't force a single opinion. It should learn to handle the fact that different people want different things.
  • We need better tools. We need AI that can say, "I see that some people like this and others don't," rather than just picking a winner and ignoring the losers.

In short, the paper argues that to build AI that truly serves everyone, we have to stop pretending that everyone wants the same thing. We need to teach AI to embrace the chaos of human disagreement.