Diverging Preferences: When do Annotators Disagree and do Models Know?

Imagine you are a head chef trying to create the perfect menu for a restaurant. You ask a panel of 10 food critics to taste two different dishes (Dish A and Dish B) and tell you which one is better.

In the world of Artificial Intelligence (AI), this is exactly what happens when we train Large Language Models (LLMs). We ask human "critics" (annotators) to judge AI responses so the AI can learn what humans like.

The Problem: The Critics Can't Agree
The paper "Diverging Preferences" discovers a massive issue: The critics often disagree, and not just because they made mistakes.

Sometimes, one critic loves a dish because it's spicy and bold, while another hates it because they prefer something mild. Both critics are "right" based on their own taste. However, for a long time, AI developers assumed that if critics disagreed, it was just "noise" or confusion. They thought, "If 6 people say A is better and 4 say B is better, we just pick A and ignore the 4."

The authors of this paper say: "Wait a minute! That 40% isn't noise. That's a real difference in human taste!"

The "Why" Behind the Disagreement

The researchers created a "menu" of reasons why critics disagree. Here are the main flavors of disagreement they found:

The Prompt Was Vague (Task Underspecification): Imagine the chef asks, "Make me a sandwich." One critic wants a turkey club; another wants a peanut butter and jelly. Neither is wrong; the request was just too open-ended.
Style Wars (Response Style): One critic loves a 5-page essay with fancy formatting; another prefers a quick, 3-sentence answer. It's not about what was said, but how it was said.
Safety vs. Helpfulness: One critic says, "Don't answer that, it's dangerous!" Another says, "Answer it, but explain why it's dangerous." They both want safety, but they disagree on the method.
Personal Taste (Aesthetic): One person likes poetry written in a silly, rhyming style; another finds it annoying.

The AI's Mistake: The "Tyranny of the Majority"

Because AI developers treated disagreement as "noise," they built AI models that act like a tyrannical majority.

The Old Way: If 60% of people prefer a long, detailed answer, the AI learns to always be long and detailed, even if 40% of people wanted a short answer. The AI becomes a "one-size-fits-all" robot that ignores the minority.
The Result: The AI becomes bad at handling tricky situations. If a user asks a vague question, the AI might confidently give a wrong answer (because it learned to guess) instead of asking, "Could you clarify?" (which some humans prefer).

The Solution: Teaching AI to "Read the Room"

The paper proposes a new way to train AI, which they call Distributional Rewards.

Think of it like this:

Old AI: "I will give you a single score: 8/10. That's the final grade."
New AI: "I will give you a range. Some people will love this (10/10), some will hate it (2/10), and the average is 6/10. I know this is a polarizing answer."

By teaching the AI to understand that disagreement is a feature, not a bug, the AI learns to:

Recognize when a topic is divisive.
Know when to ask for clarification instead of guessing.
Understand that there isn't always one "perfect" answer, but rather many valid ones depending on who is asking.

The "Judge" Problem

The paper also looked at how we test these AI models. Currently, we use other AIs (called "LLM-as-Judge") to grade the responses.

The researchers found that these "Judge AIs" are biased. They love the "safe," "long," and "compliant" answers. If a real human model decides to say, "I can't answer that safely," the Judge AI often marks it as a failure. This punishes AI models that are trying to be safe and pluralistic (respecting different views).

The Big Takeaway

This paper is a wake-up call for the AI world. It tells us:

Humans are messy. We don't all agree, and that's okay.
AI shouldn't force a single opinion. It should learn to handle the fact that different people want different things.
We need better tools. We need AI that can say, "I see that some people like this and others don't," rather than just picking a winner and ignoring the losers.

In short, the paper argues that to build AI that truly serves everyone, we have to stop pretending that everyone wants the same thing. We need to teach AI to embrace the chaos of human disagreement.

Here is a detailed technical summary of the paper "Diverging Preferences: When do Annotators Disagree and do Models Know?"

1. Problem Statement

Large Language Models (LLMs) are increasingly aligned with human preferences using Reinforcement Learning from Human Feedback (RLHF). A standard assumption in current reward modeling pipelines is that annotator disagreements are merely "noise" that can be resolved via majority voting or aggregation.

However, this paper challenges that assumption, arguing that diverging preferences (where annotators genuinely disagree on the ideal response due to differing user perspectives, not errors) are common and significant. The core problems identified are:

Misinterpretation of Disagreement: Standard reward models treat dissenting opinions as noise, forcing a single "correct" answer even when valid user preferences diverge.
Pluralistic Alignment Failure: By training models to maximize a single reward signal derived from aggregated labels, LLMs fail to serve diverse user bases, potentially alienating users with valid but minority preferences.
Evaluation Bias: "LLM-as-Judge" evaluation methods often penalize models that adopt pluralistic strategies (e.g., asking for clarification on ambiguous prompts or refusing unsafe requests) because they force a decisive "winner" in cases where human annotators disagree.

2. Methodology

Datasets and Taxonomy

The authors introduced two new datasets derived from existing resources by releasing individual annotations rather than aggregated labels:

MultiPref: 10k preference pairs with 4 annotators each.
HelpSteer2-Disagreements: 12k preference pairs with 3–5 annotators each.

They developed a taxonomy of disagreement sources spanning 10 categories across 4 high-level classes:

Task Underspecification: Ambiguous prompts allowing multiple valid interpretations.
Response Style: Preferences driven by verbosity, format, complexity, or aesthetic taste.
Refusals: Disagreements on safety boundaries (Comply vs. Refuse) or refusal styles (Soft vs. Hard).
Errors: Hallucinations or capability failures where annotators disagree on severity.

Key Finding: Over 30% of examples show diverging preferences, and >75% of disagreements stem from individual predilections (style, complexity, safety views) rather than annotation errors.

Proposed Solutions

The paper proposes two main technical interventions:

A. Distributional Reward Models
Instead of predicting a single scalar reward $r$ , the authors propose modeling rewards as distributions to capture variance in user perspectives.

Mean-Var (KL) Model: Predicts a normal distribution $r_A \sim \mathcal{N}(\mu_A, \sigma^2_A)$ $r_{A} \sim N (μ_{A}, σ_{A}^{2})$ for a response.
- $\mu_A$ : Represents the expected quality (mean preference).
- $\sigma^2_A$ : Represents the divisiveness (variance in preference).
- Training: Uses KL-Divergence loss to map the difference in distributions ( $r_A - r_B$ ) to preference labels (Significantly Preferred, Slightly Preferred, Tie), accounting for correlation between similar responses.
Classification (KL) Model: A 5-way classifier predicting the distribution of Likert scores (1–5) assigned by annotators.

B. Divisive Example Identification for Evaluation
To fix evaluation bias, the authors propose using trained distributional reward models to identify and filter out "divisive" prompts from LLM-as-Judge benchmarks (e.g., WildBench).

Metric: The "divisiveness" of a prompt is calculated as the joint probability of an annotator labeling the response as a 1 or 5 (extreme disagreement).
Action: Remove prompts with high divisiveness scores to ensure evaluations only measure performance on high-agreement tasks.

3. Key Results

Reward Modeling Performance

Standard Models Fail: Standard Bradley-Terry and MSE-Regression models (trained on aggregated or all labels) fail to distinguish between high-agreement and diverging preferences. They assign decisive reward differences ( $P(Chosen > Rejected) \approx 0.8$ ) to diverging examples just as they do to high-agreement ones.
Distributional Models Succeed:
- Preference Accuracy: Distributional models (Mean-Var KL) achieve comparable accuracy to single-value models (~66-68%).
- Diverging ID AUROC: This is the critical metric. Standard models perform near random chance (AUROC ~0.45–0.49). The proposed Mean-Var (KL) model achieves 0.615 on MultiPref and 0.582 on HelpSteer2, significantly outperforming baselines. The Classification (KL) model achieved the highest score of 0.648 on HelpSteer2.
- Conclusion: Distributional models successfully learn to identify when a response is divisive (high variance) versus when it is universally preferred.

LLM-as-Judge Bias

Decisive Bias: LLM-as-Judge methods (e.g., ChatbotArena) consistently identify a "winning" response in diverging preference cases (73.8% on MultiPref) at rates similar to high-agreement cases. They fail to assign "ties" to legitimate disagreements.
Specific Biases:
- Refusals: Judges heavily favor "Comply" over "Refuse" (68% win rate) and prefer refusals that "prescribe a solution" over simple "inability to comply."
- Ambiguity: Judges favor "Overton" responses (providing multiple interpretations) over "Clarifying" responses (asking for user input), even though both are valid pluralistic strategies.
Impact: Pluralistically aligned models (trained to clarify or refuse safely) are unfairly penalized in current benchmarks.

Benchmark Analysis (WildBench)

Applying the divisive identification method to WildBench:

42% of the top 5% most divisive prompts involved "Comply vs. Refuse" disagreements.
16% involved "Task Underspecification."
In these cases, LLM-as-Judge consistently punished the model that refused or asked for clarification, favoring the model that blindly complied or guessed.

4. Key Contributions

Empirical Evidence: Demonstrated that >30% of human preference disagreements are due to genuine diverging user perspectives, not noise.
Taxonomy: Created a structured taxonomy of disagreement causes (Task, Style, Refusals, Errors) to guide future research.
Novel Architecture: Introduced Distributional Reward Models that explicitly model variance ( $\sigma^2$ ) to detect divisive responses, outperforming standard scalar models in identifying disagreement.
Evaluation Framework: Proposed a method to sanitize LLM-as-Judge benchmarks by removing divisive examples, preventing the unfair penalization of pluralistic alignment strategies.

5. Significance

This paper fundamentally shifts the paradigm of RLHF and LLM evaluation:

From Noise to Signal: It argues that disagreement is a feature, not a bug. Ignoring it leads to models that serve only the "majority" view, potentially harming minority user groups.
Pluralistic Alignment: It provides the technical tools (distributional rewards) necessary to train models that can navigate diverse user needs, such as knowing when to ask for clarification or when to refuse based on safety, without being penalized by the training signal.
Benchmark Integrity: It highlights that current "LLM-as-Judge" benchmarks are flawed for evaluating safety and alignment because they force decisive outcomes on inherently ambiguous or subjective tasks.

In summary, the authors argue that for LLMs to be truly aligned with humanity, they must be trained to recognize and respect divergence, rather than forcing a single consensus that may not exist.