DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

The paper introduces DARC, a retraining-free, inference-time method that mitigates the brittleness of standard preference alignment by framing response selection as a distributionally robust, risk-sensitive decision-making process to explicitly manage annotator disagreement and tail risk without compromising average quality.

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you are a chef running a very popular restaurant. You have a team of food critics (the "annotators") who taste your dishes and give them a score from 0 to 10.

In the past, when training your AI chef (the Large Language Model), the goal was simple: Make the dish that gets the highest average score.

If 50 critics taste a spicy curry, and 49 love it (score 10) but 1 hates it because they can't eat spice (score 1), the average is 9.8. The AI thinks, "Great! I'll make this curry every time!"

The Problem:
The real world isn't that simple. Sometimes, the critics are deeply divided.

  • Scenario A: Everyone agrees the soup is delicious (Average: 8, Disagreement: 0).
  • Scenario B: Half the critics think the soup is a masterpiece (10), and the other half think it's inedible (2). The average is still 6.

If the AI just chases the average, it might pick Scenario B because the "average" looks okay, but it's a gamble. If you serve that soup to a random customer, there's a 50% chance they will hate it. This is called proxy over-optimization: the AI learns to game the system by picking polarizing answers that look good on paper but fail in reality.

Enter DARC: The "Risk-Aware" Sommelier

The paper introduces DARC (Disagreement-Aware Alignment via Risk-Constrained Decoding). Think of DARC not as a new chef, but as a smart sommelier who steps in right before the dish is served to the customer.

DARC doesn't retrain the chef. Instead, it looks at the list of dishes the chef has already prepared (the "candidates") and uses a new rule to pick the winner.

How DARC Works (The Analogy)

Imagine the chef has prepared 10 different versions of a response to a tricky question. DARC looks at them through two lenses:

  1. The "Average Taste" Lens: How good is it on average?
  2. The "Disagreement" Lens: How much do the critics fight about it?

The DARC Rule:

"I don't just want the dish with the highest average score. I want the dish that is consistently good and doesn't make people angry."

If Dish A has an average score of 8.5 but half the critics gave it a 1 (high disagreement), DARC says, "Too risky! That's a gamble."
If Dish B has an average score of 8.2 but everyone gave it an 8 or 9 (low disagreement), DARC says, "Safe bet! Let's serve this one."

The Secret Sauce: "Risk Budgets"

The paper introduces a concept called Risk-Constrained Decoding. Imagine you have a "Risk Budget."

  • The Old Way: "Pick the highest score, no matter what." (Like betting your whole savings on a coin flip).
  • The DARC Way: "Pick the highest score, but you are only allowed to pick a dish if the critics' disagreement is below a certain limit."

If the critics are screaming at each other about a dish (high disagreement), DARC treats that dish as "expensive" in terms of risk. It forces the AI to choose a slightly less "perfect" but much more "stable" answer.

Why This Matters in Real Life

The authors tested this on real AI models. Here is what they found:

  1. Less Polarization: When people ask controversial questions (like politics), standard AI models often pick an answer that sounds confident but makes half the readers furious. DARC picks a calmer, more balanced answer that satisfies almost everyone.
    • Analogy: Instead of a politician shouting a slogan that makes half the crowd cheer and the other half boo, DARC picks the moderate policy that keeps the whole room happy.
  2. Fewer "Hallucinations": Sometimes AI makes things up. If the AI makes up a song lyric, some critics might think it's creative, while others think it's a lie. DARC sees this high disagreement and avoids the risky, made-up answer, choosing a truthful "I don't know" instead.
  3. No Retraining Needed: This is the best part. You don't need to teach the AI a new way of thinking (which takes months and millions of dollars). You just change the selection rule at the very end. It's like changing the menu board, not the kitchen.

The "Proxy" Trick

You might ask: "How does DARC know the critics disagree if we haven't asked 50 people yet?"

DARC uses a clever trick called a Proxy. Instead of waiting for 50 humans to taste the soup, the AI asks itself: "If I slightly change the wording of this answer (like adding a comma or changing a word), does the 'taste score' change wildly?"

  • If a tiny change makes the score jump from 2 to 10, the AI knows the answer is fragile and controversial.
  • If the score stays steady, the answer is robust.

DARC uses this "fragility" as a stand-in for human disagreement. It's like a sommelier shaking a bottle of wine; if it foams and splatters everywhere, they know it's unstable and won't serve it.

Summary

DARC is a safety net for AI. It stops the AI from chasing "average" scores that hide dangerous disagreements. It forces the AI to choose answers that are reliable and safe, ensuring that when you ask a question, you get an answer that won't accidentally offend half the people reading it.

It's the difference between a gambler hoping for a jackpot and a prudent investor building a stable portfolio.