DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Imagine you are a chef running a very popular restaurant. You have a team of food critics (the "annotators") who taste your dishes and give them a score from 0 to 10.

In the past, when training your AI chef (the Large Language Model), the goal was simple: Make the dish that gets the highest average score.

If 50 critics taste a spicy curry, and 49 love it (score 10) but 1 hates it because they can't eat spice (score 1), the average is 9.8. The AI thinks, "Great! I'll make this curry every time!"

The Problem:
The real world isn't that simple. Sometimes, the critics are deeply divided.

Scenario A: Everyone agrees the soup is delicious (Average: 8, Disagreement: 0).
Scenario B: Half the critics think the soup is a masterpiece (10), and the other half think it's inedible (2). The average is still 6.

If the AI just chases the average, it might pick Scenario B because the "average" looks okay, but it's a gamble. If you serve that soup to a random customer, there's a 50% chance they will hate it. This is called proxy over-optimization: the AI learns to game the system by picking polarizing answers that look good on paper but fail in reality.

Enter DARC: The "Risk-Aware" Sommelier

The paper introduces DARC (Disagreement-Aware Alignment via Risk-Constrained Decoding). Think of DARC not as a new chef, but as a smart sommelier who steps in right before the dish is served to the customer.

DARC doesn't retrain the chef. Instead, it looks at the list of dishes the chef has already prepared (the "candidates") and uses a new rule to pick the winner.

How DARC Works (The Analogy)

Imagine the chef has prepared 10 different versions of a response to a tricky question. DARC looks at them through two lenses:

The "Average Taste" Lens: How good is it on average?
The "Disagreement" Lens: How much do the critics fight about it?

The DARC Rule:

"I don't just want the dish with the highest average score. I want the dish that is consistently good and doesn't make people angry."

If Dish A has an average score of 8.5 but half the critics gave it a 1 (high disagreement), DARC says, "Too risky! That's a gamble."
If Dish B has an average score of 8.2 but everyone gave it an 8 or 9 (low disagreement), DARC says, "Safe bet! Let's serve this one."

The Secret Sauce: "Risk Budgets"

The paper introduces a concept called Risk-Constrained Decoding. Imagine you have a "Risk Budget."

The Old Way: "Pick the highest score, no matter what." (Like betting your whole savings on a coin flip).
The DARC Way: "Pick the highest score, but you are only allowed to pick a dish if the critics' disagreement is below a certain limit."

If the critics are screaming at each other about a dish (high disagreement), DARC treats that dish as "expensive" in terms of risk. It forces the AI to choose a slightly less "perfect" but much more "stable" answer.

Why This Matters in Real Life

The authors tested this on real AI models. Here is what they found:

Less Polarization: When people ask controversial questions (like politics), standard AI models often pick an answer that sounds confident but makes half the readers furious. DARC picks a calmer, more balanced answer that satisfies almost everyone.
- Analogy: Instead of a politician shouting a slogan that makes half the crowd cheer and the other half boo, DARC picks the moderate policy that keeps the whole room happy.
Fewer "Hallucinations": Sometimes AI makes things up. If the AI makes up a song lyric, some critics might think it's creative, while others think it's a lie. DARC sees this high disagreement and avoids the risky, made-up answer, choosing a truthful "I don't know" instead.
No Retraining Needed: This is the best part. You don't need to teach the AI a new way of thinking (which takes months and millions of dollars). You just change the selection rule at the very end. It's like changing the menu board, not the kitchen.

The "Proxy" Trick

You might ask: "How does DARC know the critics disagree if we haven't asked 50 people yet?"

DARC uses a clever trick called a Proxy. Instead of waiting for 50 humans to taste the soup, the AI asks itself: "If I slightly change the wording of this answer (like adding a comma or changing a word), does the 'taste score' change wildly?"

If a tiny change makes the score jump from 2 to 10, the AI knows the answer is fragile and controversial.
If the score stays steady, the answer is robust.

DARC uses this "fragility" as a stand-in for human disagreement. It's like a sommelier shaking a bottle of wine; if it foams and splatters everywhere, they know it's unstable and won't serve it.

Summary

DARC is a safety net for AI. It stops the AI from chasing "average" scores that hide dangerous disagreements. It forces the AI to choose answers that are reliable and safe, ensuring that when you ask a question, you get an answer that won't accidentally offend half the people reading it.

It's the difference between a gambler hoping for a jackpot and a prudent investor building a stable portfolio.

1. Problem Statement

Current preference-based alignment methods for Large Language Models (LLMs), such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically optimize a single scalar objective (e.g., expected reward). This approach implicitly assumes that human preferences are independent and identically distributed (i.i.d.) noise around a single latent utility.

However, in practice, human preferences are heterogeneous. Annotators often disagree systematically due to differing values, cultural backgrounds, or interpretations of instructions.

The Issue: Maximizing the mean reward ( $\hat{\mu}$ ) is brittle in the presence of high disagreement. It can lead to "proxy over-optimization," where the model exploits imperfections in the reward proxy to generate outputs that score highly on average but are polarizing, risky, or fail for specific user groups.
The Gap: Existing inference-time methods (like Best-of- $N$ ) often lack a principled framework for handling this disagreement. They either ignore risk or rely on ad-hoc heuristics that do not explicitly model the distribution of user satisfaction.

2. Methodology: DARC

The authors propose DARC (Disagreement-Aware Alignment via Risk-Constrained Decoding), a retraining-free, inference-time method. It frames response selection as a risk-constrained decision-making problem.

Core Concepts

Latent Satisfaction as a Random Variable: Instead of a single score, the satisfaction $R(s, y)$ for a prompt $s$ and response $y$ is treated as a random variable capturing preference heterogeneity.
KL-Robust (Entropic) Value: DARC optimizes for the entropic value $V_\beta$ , which is equivalent to a Distributionally Robust Optimization (DRO) objective over a KL-divergence neighborhood.
$V_\beta(s, y) = -\frac{1}{\beta} \log \mathbb{E}[\exp(-\beta R(s, y))]$
This objective is sensitive to the tail of the distribution, effectively penalizing high-variance (disagreement) outcomes.
Entropic Risk Premium: The difference between the mean and the entropic value, $RP_\beta = \mu - V_\beta$ , serves as a measure of risk (disagreement).

Decoding Rules

Given a set of candidate responses $Y(s)$ , DARC selects the best candidate using one of three strategies:

Entropic Maximization: Select $y$ that maximizes $V_\beta$ . This naturally balances high mean reward with low disagreement.
Risk-Constrained (Budget): Select $y$ that maximizes $V_\beta$ subject to $RP_\beta \leq \tau$ . This allows users to set a hard "risk budget" (maximum allowable disagreement).
Penalized (Lagrangian): Select $y$ that maximizes $V_\beta - \lambda RP_\beta$ . This provides a tunable trade-off between quality and robustness.

Theoretical Foundations

Lower Confidence Bounds (LCB): The authors derive a finite-sample guarantee showing that maximizing a Lower Confidence Bound on satisfaction is equivalent to penalizing the standard deviation ( $\sigma$ ). This links DARC to statistical pessimism.
Distributionally Robust Optimization (DRO): They prove that the entropic value corresponds to the worst-case expected satisfaction over a local KL-ambiguity set. This provides a rigorous theoretical justification for why penalizing variance/disagreement improves robustness.
Multi-Scorer Robustness: To handle proxy reward models, DARC aggregates scores from multiple reward models (or perturbations) using a "soft worst-case" operator, hedging against scorer shift and proxy over-optimization.

Practical Implementation

Proxy Disagreement: Since collecting multiple human ratings for every candidate is expensive, DARC uses style-preserving perturbations of the response to generate a distribution of proxy scores from a reward model. The variance of these scores serves as a scalable proxy for human disagreement.
$\epsilon$ -Tie Breaking: To avoid sacrificing too much quality for marginal robustness, DARC first selects candidates within an $\epsilon$ -margin of the best entropic value and then picks the one with the lowest disagreement.

3. Key Contributions

Methodological Innovation: Introduced DARC, the first inference-time alignment method that explicitly frames response selection as risk-constrained decision making under heterogeneous preferences, without requiring model retraining.
Theoretical Unification: Connected Lower Confidence Bound (LCB) based pessimism with KL-based Distributionally Robust Optimization (DRO), showing that penalizing disagreement is a principled way to control tail risk.
Scalable Proxy: Validated that reward-model sensitivity to style-preserving perturbations is a statistically significant proxy for human disagreement, enabling scalable deployment.
Multi-Scorer Hedging: Proposed a mechanism to hedge against reward model bias by aggregating multiple scorers via a soft-min operator, further reducing proxy over-optimization.

4. Experimental Results

The authors evaluated DARC on MT-Bench and AlpacaEval 2.0 using Llama-3.1-8B and Qwen2.5-7B/14B models.

Performance on High-Disagreement Prompts: DARC variants significantly outperformed baselines (including Best-of-K, HedgeTune, and Caution) on the top 20% of prompts with the highest baseline disagreement.
- Reduced Tail Risk: DARC improved CVaR $_{10\%}$ (Conditional Value at Risk, measuring the worst 10% of outcomes) by a large margin, indicating better performance on controversial prompts.
- Lower Disagreement: The standard deviation of human ratings ( $\sigma$ ) was significantly reduced, meaning DARC outputs are more consistently preferred.
- Competitive Mean Quality: While reducing risk, DARC maintained competitive average human satisfaction scores ( $\mu$ ), avoiding the "safe but boring" trap.
Human Evaluation: In human-loop studies, DARC-ϵ achieved the highest Tradeoff score ( $\mu - \lambda\sigma$ ) and the best CVaR, particularly in the high-disagreement subset.
Case Studies:
- Political Framing: DARC avoided polarizing, one-sided arguments in favor of neutral, institutional explanations, reducing rating variance from 2.92 to 0.84.
- Hallucination/Safety: DARC successfully selected truthful refusals over hallucinated or copyright-infringing content that the mean-based baseline selected.
Efficiency: The method adds minimal inference overhead (approx. 1.5–3% latency increase) as it only requires scoring perturbations of a fixed candidate pool.

5. Significance and Impact

Shift from Mean to Distribution: DARC challenges the prevailing paradigm of optimizing for a single scalar mean, advocating instead for optimizing the distribution of user satisfaction.
Deployment Flexibility: As a retraining-free inference-time plug-in, DARC can be applied to any existing LLM and reward model, making it immediately actionable for industry deployment.
Robustness to Heterogeneity: It provides a concrete solution to the "alignment tax" where models become brittle or polarizing when faced with diverse human values. By explicitly budgeting for risk, it allows developers to tune how conservative the model should be.
Theoretical Grounding: The paper bridges the gap between statistical learning theory (LCBs) and robust optimization (DRO), offering a unified view of why risk-sensitive decoding works.

In summary, DARC offers a principled, efficient, and effective way to align LLMs with diverse human preferences by explicitly managing the risk of disagreement, ensuring that models remain helpful and safe even in the most controversial or ambiguous scenarios.