Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

Imagine you are training a robot chef to cook the perfect meal. You don't just want to know if the food is "good" or "bad." You want to know how much better one dish is than another. Is the soup "slightly better," "much better," or "absolutely divine"?

For a long time, the way we taught AI to understand human taste was very blunt. We only gave it binary choices: "This burger is better than that burger." The AI learned to pick a winner, but it didn't understand the nuance of the victory. It was like a judge in a talent show who only says "Pass" or "Fail," never giving a score of 1 to 10.

This paper, "Beyond Binary Preferences," proposes a smarter, more mathematical way to teach AI using those detailed "1-to-10" scores (called Likert scales).

Here is the breakdown of their idea using simple analogies:

1. The Problem: The "Binary Blindfold"

Current AI training methods (like the famous Bradley-Terry model) are like a binary blindfold. They force the AI to see the world in black and white.

The Old Way: If you tell the AI, "This answer is significantly better," the AI treats it exactly the same as "This answer is slightly better." It just learns "Winner vs. Loser."
The Heuristic Hacks: To fix this, researchers tried to "patch" the system with manual rules. They'd say, "If the human says 'significantly better,' multiply the reward by 3." But this is like trying to tune a radio by guessing the numbers on the dial. It's messy, requires constant manual tweaking, and doesn't actually understand why the signal is stronger.

2. The Solution: The "Ordinal Regression" Ladder

The authors suggest treating human feedback not as a simple "Win/Loss" coin flip, but as a ladder.

Imagine a ladder with rungs labeled:

Negligibly Better
Slightly Better
Moderately Better
Significantly Better

The AI's job is to learn where to place the "rungs" (thresholds) on a continuous scale of quality.

The Magic: Instead of us humans guessing where to put the rungs (e.g., "Is 'Significantly Better' at 8.5 or 9.0?"), the paper's framework lets the AI learn the rungs automatically from the data.
The Result: The AI learns that a "Significantly Better" answer needs to be much higher on the quality ladder than a "Slightly Better" one. It discovers the natural gaps between these feelings.

3. Two Ways to Climb the Ladder

The paper proposes two mathematical "climbing strategies" to teach the AI how to place these rungs:

Strategy A: The Probabilistic Map (NLL)
Think of this as drawing a map of probability. The AI asks, "What is the chance that this answer falls into the 'Significantly Better' bucket?" It learns to adjust the map so the probabilities match what humans actually said. It's like learning the weather patterns to predict if it will rain "a little" or "a lot."
Strategy B: The Margin Guard (All-Threshold)
Think of this as a security guard at a building with different floors. The guard ensures that a "Slightly Better" answer doesn't accidentally get on the "Significantly Better" floor. If the answer is too low, the guard pushes it up; if it's too high, the guard pushes it down. The AI learns exactly how hard to push to get the answer to the right floor.

4. Why This Matters: The "Error Severity" Test

The most exciting part of the paper is what happens when the AI makes a mistake.

Old AI (Binary): When it gets it wrong, it might be confidently wrong. It might say, "This terrible answer is definitely the winner!" with a huge gap in its confidence. This is dangerous because it can lead the AI to make bad decisions later.
New AI (Ordinal): When this AI gets it wrong, it is hesitant. If it picks the wrong answer, the gap between the "winner" and "loser" is tiny. It's like a judge who says, "I'm not sure, but I picked A," rather than "A is clearly the best!"
- Analogy: The old AI is a loud, overconfident driver who swerves into the wrong lane. The new AI is a cautious driver who barely nudges the steering wheel if they are unsure.

5. The "Symmetry" Discovery

The researchers also found something beautiful: Human preferences are often symmetric.

If you think Answer A is "Much Better" than Answer B, you should feel the same way if you look at it as "Answer B is Much Worse" than Answer A.
The paper's framework discovered that by forcing this symmetry (making the ladder look the same going up and down), the AI actually learned faster and performed better. It's like realizing that a mirror image helps you understand the shape of an object better.

The Bottom Line

This paper moves us away from "guessing" how to handle detailed human feedback and gives us a principled, mathematical foundation for it.

Instead of forcing human feelings into a rigid "Yes/No" box, we are finally giving the AI a thermometer instead of a light switch. It can now feel the temperature of our preferences, leading to smarter, safer, and more aligned AI systems that understand the difference between "good" and "great."

1. Problem Statement

Current Large Language Model (LLM) alignment relies heavily on Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). These methods fundamentally depend on the Bradley-Terry (BT) model, which treats human feedback as binary (Response A is preferred over Response B).

However, real-world human annotation often provides richer ordinal feedback (e.g., Likert scales: "significantly better," "better," "slightly better," "negligibly better"). Existing attempts to utilize this ordinal data rely on ad-hoc heuristics, such as:

Adding fixed margin terms to the loss function.
Scaling the loss by preference strength.
Treating ordinal levels as soft probability labels.

Limitations of Current Approaches:

Lack of Theoretical Foundation: They modify binary losses without a principled mathematical model of how ordinal data is generated.
Hyperparameter Sensitivity: They require manual tuning of margins or scaling factors (e.g., "how much does 'significantly better' weigh vs. 'slightly better'?"), which are dataset-specific and brittle.
Inefficiency: They fail to fully leverage the signal contained in the magnitude of preference, often ignoring the ordinal structure.

2. Methodology

The authors propose reframing reward modeling with ordinal feedback as a Discrete Ordinal Regression problem, a well-established statistical framework.

Core Formulation

Instead of a binary classifier, the model learns a reward function $r_\phi(x, y)$ and a set of learned thresholds $\zeta$ that partition the continuous reward difference space into discrete ordinal intervals.

Input: A prompt $x$ and two responses $y, y'$ .
Target: An ordinal label $z \in \{-K, \dots, 0, \dots, K\}$ , where $z=0$ implies equality, and $|z|$ indicates the strength of preference.
Predictor: The reward difference $s_\phi(x, y, y') = r_\phi(x, y) - r_\phi(x, y')$ .

Proposed Loss Functions

The paper derives two principled loss functions from the ordinal regression framework:

Probabilistic Approach (Negative Log-Likelihood - NLL):
- Based on the Ordered Logit Model.
- Models the conditional probability $P(z|x)$ using cumulative distribution functions (sigmoid).
- Loss: $L_{NLL} = -\log P(z | x)$ .
- This approach learns the probability of the reward difference falling within a specific interval defined by thresholds.
Margin-Based Approach (All-Threshold Loss - AT):
- Inspired by large-margin methods (e.g., SVMs).
- Directly penalizes violations of the ordinal structure without explicit probabilistic modeling.
- Loss: Accumulates penalties for the reward difference falling on the wrong side of any threshold relative to the true label.
- This is more sensitive to the degree of misclassification (e.g., predicting "slightly better" when it is "significantly better" is penalized more than a single-level error).

Key Theoretical Contributions

Regularization Necessity (Theorem 3.1): The authors prove that without regularization, the optimization problem for ordinal regression admits unbounded solutions. As the scale of rewards and thresholds increases, the loss approaches zero even for incorrect orderings. Therefore, L2 regularization on thresholds is mathematically required to ensure a finite, stable optimal solution.
Symmetry (Theorem 3.2): Under the assumption that human preferences are symmetric (preferring A over B by level $k$ is equivalent to preferring B over A by level $-k$ ), the optimal thresholds must satisfy $\zeta_{-k} = -\zeta_k$ . This reduces the number of parameters and improves generalization.
Reparameterization: To handle the ordering constraints ( $\zeta_{-K} < \dots < \zeta_K$ ), the authors use a monotonic transformation (exponential mapping) to convert constrained optimization into unconstrained optimization.

3. Experimental Setup

Datasets: HelpSteer2 and HelpSteer3 (containing 7-level Likert scale annotations from -3 to +3).
Base Models: Llama-3.1-8B, Mistral-7B, and Zephyr-7B.
Baselines: Margin BT, Scaled BT, and Soft Label (the standard heuristic approaches).
Evaluation Benchmarks: RewardBench and RM-Bench (covering Chat, Math, Code, Safety, and Reasoning).

4. Key Results

The proposed Ordinal Regression methods consistently outperform or match heuristic baselines across diverse tasks.

Performance: The NLL-Symmetric variant (Probabilistic loss with symmetric thresholds) emerged as the most effective approach, achieving 2-5% higher average scores on benchmarks compared to the best baselines.
Ordinal Accuracy: Beyond binary ranking accuracy, the model successfully predicts preference strength:
- Achieved ~55% exact accuracy (predicting the exact ordinal level).
- Achieved ~85% accuracy within one ordinal level (e.g., predicting "better" when the truth is "slightly better").
Error Severity (Calibration): This is a critical finding. While binary models often make high-confidence errors (large reward gaps for wrong rankings), the ordinal model makes low-confidence errors:
- Mean error margin dropped from 3.827 (Binary) to 0.501 (Ordinal) (an 87% reduction).
- The ordinal model rarely assigns high rewards to clearly dispreferred responses, making it safer for downstream RL.
Joint Training vs. Post-Hoc Calibration: Learning thresholds jointly with the reward model is essential. Post-hoc calibration of a standard binary model failed to recover the fine-grained ordinal structure, resulting in significantly higher error rates (MAE 1.060 vs 1.725).
Robustness: The framework is highly robust to systematic shift noise (e.g., annotators consistently overestimating preference strength), maintaining performance even at 100% shift corruption. It degrades gracefully under random noise.

5. Significance and Impact

Theoretical Advancement: This work provides the first principled mathematical framework for incorporating Likert-scale preferences into reward modeling, moving beyond heuristic patches.
Practical Utility: By learning thresholds directly from data, it eliminates the need for manual hyperparameter tuning of margins or scaling factors, making the system more robust and adaptable to different datasets.
Safety and Reliability: The dramatic reduction in "high-confidence" errors suggests that ordinal reward models provide safer, better-calibrated signals for Reinforcement Learning, reducing the risk of policy collapse due to misleading reward signals.
Future Direction: The framework naturally extends to DPO (Direct Preference Optimization) and can handle more complex feedback structures (multi-aspect ratings, confidence scores), paving the way for more nuanced human-AI alignment.

In summary, the paper demonstrates that treating human feedback as inherently ordinal, rather than forcing it into a binary mold, leads to more accurate, robust, and interpretable reward models.