Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Imagine you are the mayor of a bustling city, and you need to decide on the city's new "vibe." You ask 1,000 residents what they think.

Group A (49%) loves jazz.
Group B (49%) loves heavy metal.
Group C (2%) loves opera.

In the past, if you asked a standard AI (like the ones used in RLHF or NLHF) to figure out the best music policy, it would look at the data and say, "Jazz and Metal are almost tied, but Jazz has a tiny edge. So, we will ban Metal and Opera and play only Jazz."

This feels wrong, right? 98% of the city is unhappy. The AI ignored the nuance and the minority groups because it was trying to find a single "winner."

This paper, "Beyond RLHF and NLHF," proposes a smarter way to listen to the crowd. It introduces a new framework called Population-Proportional Alignment (PPA).

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Problem: The "Tyranny of the Tiny Margin"

Current AI alignment methods (RLHF and NLHF) are like a referee in a soccer game who only cares about the final score. If Team A wins 1-0, the referee declares Team A the only winner and ignores the fact that the game was incredibly close and that 49% of the fans were screaming for Team B.

In the AI world, this leads to policies that are biased toward the largest group or the group that happens to have a slight statistical advantage, completely silencing smaller but significant viewpoints. It also makes the AI fragile; if a few people lie about their preferences, the AI might flip-flop wildly.

2. The Solution: The "Fair Pie" Approach

The authors suggest that instead of picking one "winner," the AI should act like a fair baker.

If 49% of people want Jazz, 49% want Metal, and 2% want Opera, the AI shouldn't pick one. It should bake a pie where:

49% of the pie is Jazz.
49% of the pie is Metal.
2% of the pie is Opera.

This is Population-Proportional Alignment. The AI's output (the policy) should reflect the true distribution of the population's desires, not just the "winning" desire.

3. The Magic Trick: Reading Minds from Whispers

Here is the tricky part: In the real world, we rarely get to ask people, "What is your exact ranking of all options?" (That's too much work). We usually only get pairwise comparisons: "Do you prefer Jazz or Metal?" "Do you prefer Metal or Opera?"

The paper's biggest breakthrough is a mathematical trick. The authors realized that even if you only have these "whispers" (pairwise comparisons), you can mathematically deduce the range of possible population groups.

Think of it like a detective looking at footprints. You don't see the people, but you see the size of the footprints. You can't know exactly who made them, but you can figure out: "There must be at least 20% of people wearing size 10 shoes, and no more than 40%."

The AI uses this logic to estimate the feasible set of population groups. It doesn't guess the exact numbers; it calculates the safest possible distribution that fits the data.

4. The "Anti-Cheating" Shield

The paper also introduces a rule called Population-Bounded Manipulability (PBM).

Imagine a loud group of 10% of the population tries to cheat. They all lie and say, "We actually love Opera! We are 90% of the city!"

Old AI: Might get tricked and switch to playing only Opera.
New AI: The math acts as a shield. It knows, "Even if you lie, you can't make the data look like you are 90% of the population if your footprints don't match." The AI limits how much influence a lying group can have. It ensures that a group can only get what they deserve based on their true size, not their fake size.

5. The "Slider" for Control

The authors created a "slider" (a parameter called $\beta$ ) that lets you decide how strict you want to be.

Slider at one end: The AI is super fair and strictly proportional (giving everyone a slice of the pie), even if it means the "winner" doesn't get 100% of the attention.
Slider at the other end: The AI acts like a traditional dictator, picking the single "Condorcet Winner" (the option that beats everyone else in head-to-heads).
In the middle: You get a smooth mix. You can tune the AI to be 80% fair to the population and 20% focused on the clear winner.

Why Does This Matter?

For LLMs (Chatbots): Instead of a chatbot that only sounds like the "average" user (ignoring niche experts or minority cultures), it can be tuned to reflect the diversity of its users.
For Recommendation Systems: Instead of showing everyone the same "most popular" movie, it can show a mix that respects the different tastes of different user groups.
For Democracy: It offers a mathematical way to ensure that minority voices aren't just "noise" but are represented in the final decision proportionally.

The Bottom Line

This paper is about moving AI from being a Dictator (who picks one winner based on a tiny margin) to being a Fair Mediator (who ensures the final decision reflects the true mix of the crowd). It uses advanced math to listen to the "whispers" of pairwise comparisons and reconstructs a fair picture of who is actually in the room, ensuring that no group gets silenced just because they are slightly smaller than the rest.

1. Problem Statement

Current preference learning methods for aligning AI systems (such as RLHF and NLHF) face significant limitations when aggregating preferences from diverse human evaluators:

Bias and Over-representation: Standard methods (like Bradley-Terry based RLHF or Nash-based NLHF) often prioritize the majority opinion or the "Condorcet winner," potentially ignoring minority groups or failing to reflect the true population distribution of preferences.
Strategic Manipulation: These methods are susceptible to strategic misreporting, where a group can manipulate the outcome to over-represent their preferences beyond their actual population share.
Infeasibility of Group Labels: Existing approaches to "pluralistic alignment" often require explicit knowledge of evaluator group identities (e.g., "Group A" vs. "Group B"), which is rarely available in real-world scenarios where group identities are implicit or unobservable.
The Core Challenge: How to infer the underlying population distribution of evaluator preferences solely from pairwise comparison data (without explicit group labels) and align the AI policy proportionally to this distribution while maintaining robustness against manipulation.

2. Methodology

The authors propose a novel framework grounded in Social Choice Theory that moves beyond scalar reward modeling to probabilistic social choice.

A. Theoretical Foundation

Probabilistic Social Choice Function (PSCF): The problem is framed as mapping a profile of rankings (population distribution) to a policy (probability distribution over alternatives).
Feasible Set Inference: Since the true population distribution ( $w$ ) cannot be uniquely recovered from pairwise data ( $P$ ), the authors define a feasible set of population distributions $\mathcal{W}(P)$ . They prove that a distribution $w$ is feasible if it can be decomposed into group-specific preferences where each group unanimously prefers one alternative.
Outer Approximation: To make the problem tractable, they derive a polyhedral outer approximation $\bar{\mathcal{W}}(P)$ defined by upper bounds $u_i = \min_{y \neq y_i} P(y_i \succ y)$ . This implies that the population share of a group preferring $y_i$ cannot exceed the minimum probability that $y_i$ is preferred over any other option.

B. Proposed Axioms

The framework introduces two new axioms to ensure fairness and robustness:

Population-Proportional Alignment (PPA): The policy's probability of selecting a group's top choice should be at least proportional to that group's population share ( $\pi(y_k) / w_k \geq \alpha$ ).
Population-Bounded Manipulability (PBM): The maximum gain a group can achieve through strategic manipulation is bounded by an affine function of their true population share. This prevents a minority group from forcing a deterministic policy in their favor.

C. The Algorithm ( $F^*$ and $F_\beta$ )

Baseline Algorithm ( $F^*$ ): The policy assigns probabilities to alternatives proportional to the derived upper bounds $u_i$ :
$\pi(y_i) = \frac{u_i}{\sum_j u_j}$
This conservative strategy ensures the policy remains within the feasible set of all possible population distributions, satisfying PPA and PBM.
Softmax Relaxation ( $F_\beta$ ): To balance PPA with Condorcet Consistency (selecting the option that beats all others in pairwise comparisons, which is impossible to satisfy simultaneously with perfect PPA), the authors introduce a temperature parameter $\beta$ $β$ :
$\pi(y_i) = \frac{u_i \exp(\beta u_i)}{\sum_j u_j \exp(\beta u_j)}$
- $\beta = 0$ : Pure proportional alignment ( $F^*$ ).
- $\beta \to \infty$ : Converges to the minimax Condorcet method (deterministic selection of the strongest alternative).

D. Scalable Implementation for LLMs

For high-dimensional settings (like Large Language Models), the authors propose a two-phase offline learning algorithm with function approximation:

Phase 1 (Estimate $u$ ): Train a "selector model" $\mu$ to estimate the upper bounds $u(y|x)$ from pairwise data using a specific loss function derived from linear programming duality.
Phase 2 (Estimate $\pi$ ): Train the policy model $\pi$ to minimize the KL divergence between the policy and the target softmax distribution constructed from the estimated $u$ .

3. Key Contributions

Inference from Pairwise Data: Demonstrated that the set of feasible population distributions can be inferred directly from pairwise comparisons without requiring explicit group labels.
Axiomatic Framework: Introduced and proved that their framework satisfies four critical axioms: Monotonicity, Pareto Efficiency, Population-Proportional Alignment (PPA), and Population-Bounded Manipulability (PBM).
Impossibility Results: Proved that no preference learning algorithm can simultaneously satisfy perfect PPA and Condorcet consistency, motivating the trade-off controlled by $\beta$ .
Scalable Algorithm: Developed a function approximation method suitable for LLMs, enabling the framework to scale to high-dimensional action spaces.

4. Experimental Results

The authors validated the framework on both tabular and large-scale LLM tasks:

Tabular Experiment (Movie Recommendation):
- Setup: 20 movies, 1,297 evaluator rankings.
- Findings: RLHF and NLHF achieved high win rates but 0% PPA (ignoring minority groups). The proposed method ( $F_\beta$ ) showed a clear trade-off: as $\beta$ increased, win rate improved while PPA decreased. Crucially, the proposed method significantly reduced manipulability (PBM) compared to baselines.
Large-Scale Experiment (LLM Alignment):
- Setup: Fine-tuned Qwen2.5-3B-Instruct on synthetic color-preference data and the Alpaca-GPT4 dataset (grouped by expertise and style).
- Findings: The method successfully controlled the trade-off between win rate and PPA in high-dimensional settings. On the synthetic dataset, the trade-off was sharp; on Alpaca-GPT4, noise in group classification via GPT-4.1 obscured the effect slightly, but the framework remained effective.
- Efficiency: The computational cost was comparable to RLHF and higher than DPO, suggesting a need for further optimization but proving feasibility.

5. Significance and Impact

Paradigm Shift: Moves AI alignment from "maximizing a single reward" to "proportionally representing a population," addressing the "tyranny of the majority" in current RLHF/NLHF approaches.
Robustness: Provides a theoretical guarantee against strategic manipulation, ensuring that no single group can disproportionately influence the AI's behavior relative to their actual size.
Practical Applicability: Unlike previous pluralistic alignment methods, this approach does not require explicit group labels, making it applicable to real-world scenarios where user demographics or preferences are implicit.
Theoretical Bridge: Strengthens the connection between Reinforcement Learning from Human Feedback and Social Choice Theory, offering a rigorous axiomatic basis for future alignment algorithms.

In conclusion, this paper presents a mathematically rigorous and practically scalable solution for aligning AI with diverse human populations, ensuring that minority voices are not drowned out while maintaining robustness against manipulation.