Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Imagine you are a chef running a massive, high-tech restaurant. You have thousands of customers, and your goal is to cook dishes that everyone loves.

The Problem: The "One-Size-Fits-All" Menu

In the past, your restaurant used a standard recipe book (called RLHF or GRPO in the tech world). The rule was simple: "Cook what the majority of customers seem to like."

The Scenario: You have a group of 100 people sitting at a table. 90 of them love spicy food, and 10 of them hate it and prefer mild, bland food.
The Mistake: Your standard recipe book looks at the whole table, averages their feedback, and decides: "Okay, the average person likes it slightly spicy." So, you cook medium-spicy food for everyone.
The Result: The 90 spicy-lovers are happy, but the 10 mild-lovers are miserable. Worse, because the 90 spicy-lovers are so loud, the 10 mild-lovers' complaints get drowned out. Over time, your kitchen stops trying to cook mild food entirely because the "average" feedback says it's not popular.

In the world of AI, this means the AI learns to be great at what the majority of users want, but it becomes terrible at understanding the unique, quiet, or minority preferences of individual users. It creates a "one-size-fits-all" personality that feels generic and sometimes frustrating.

The Solution: Personalized GRPO (P-GRPO)

The authors of this paper, Jialu Wang and his team at Apple, realized that treating everyone the same is a flaw. They invented a new way to train AI called Personalized Group Relative Policy Optimization (P-GRPO).

Think of P-GRPO as hiring a personal sommelier (wine expert) for every single customer instead of just one head chef for the whole room.

How It Works (The Analogy)

Grouping the Guests: Instead of looking at the whole table, the AI first figures out which "club" or "group" a user belongs to. Maybe User A is a "Jazz Lover" and User B is a "Metalhead."
The Old Way (Standard GRPO): The AI asks, "How good was this song compared to the other songs we just played for this whole group?"
- Problem: If the Jazz group is small and the Metal group is huge, the Jazz songs get judged against the loud Metal songs. The Jazz songs look "bad" by comparison, even if the Jazz lover loved them.
The New Way (P-GRPO): The AI asks, "How good was this song compared to other songs this specific Jazz lover has heard before?"
- The Magic: It keeps a secret, private scorecard for every single user group.
- If a Jazz lover gets a song they love, the AI says, "Great! This is a 10/10 for you," even if the Metalheads would hate it.
- If a Metalhead gets a song they love, the AI says, "Great! This is a 10/10 for you," even if the Jazz lovers would hate it.

By comparing a user's experience only against their own history, the AI stops trying to please everyone at once. It learns that "good" is different for different people.

Why This Matters

The paper shows that this new method is like giving the AI a pair of specialized glasses for every type of user.

Faster Learning: The AI learns what specific users want much quicker because it's not confused by conflicting signals from other groups.
Fairness: The "quiet" minority groups (like the mild-food lovers) finally get heard. The AI doesn't just ignore them to please the loud majority.
No Loss of Smarts: The authors tested this and found that the AI didn't get "dumber" at general tasks (like math or logic) just because it learned to be more personal. It kept its general brain power while gaining a personal touch.

The Bottom Line

Current AI is like a generic radio station playing the same top 40 hits for everyone. P-GRPO turns that radio into a smart streaming service that knows exactly what you like, remembers your taste, and curates a playlist just for you, without losing the ability to play a great song for anyone else.

It's a shift from "What does the crowd want?" to "What does you want?"—and doing it in a way that makes the AI smarter and fairer for everyone.

Here is a detailed technical summary of the paper "Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment."

1. Problem Statement

Large Language Models (LLMs) are typically aligned with human preferences using Reinforcement Learning from Human Feedback (RLHF) or Group Relative Policy Optimization (GRPO). However, standard alignment methods suffer from a critical limitation: they assume a homogeneous preference distribution.

The Flaw in GRPO: GRPO stabilizes training by sampling a group of completions for a single prompt and normalizing rewards within that group. This assumes all samples in the batch are exchangeable and drawn from the same underlying preference distribution.
The Consequence: In reality, user preferences are heterogeneous (varying by culture, personality, context, etc.). When GRPO normalizes across a mixed batch of diverse users, it implicitly treats minority or distinct preference signals as "noise" relative to the dominant group. This leads to statistical shrinkage, where the model converges to a policy that optimizes for the majority preference while systematically suppressing or degrading performance for minority user groups.

2. Methodology: Personalized GRPO (P-GRPO)

The authors propose P-GRPO, a novel on-policy reinforcement learning framework that decouples advantage estimation from immediate batch statistics to account for reward heterogeneity.

Core Mechanism

Instead of normalizing rewards against the mean and standard deviation of the current generation batch ( $\mu_{Batch}, \sigma_{Batch}$ ), P-GRPO normalizes rewards against preference-group-specific historical statistics ( $\mu_{cluster}, \sigma_{cluster}$ ).

Preference Grouping: Users are partitioned into clusters based on explicit identifiers (e.g., user IDs) or implicit signals (e.g., K-Means clustering of interaction history).
Online Statistics Maintenance: For each preference group $p$ , the algorithm maintains running statistics of observed rewards using Welford's online algorithm. This allows for $O(1)$ memory complexity updates of the mean ( $\mu_p$ ) and variance ( $\sigma_p^2$ ) without storing the entire history of rewards.
Personalized Advantage Calculation: The advantage function $\tilde{A}$ for a completion $o_i$ belonging to preference group $p$ is computed as:
$\tilde{A}_{i,t} = \frac{R_i - \mu_p}{\sigma_p + \epsilon}$
Where $R_i$ is the reward, $\mu_p$ is the historical mean reward for group $p$ , and $\sigma_p$ is the historical standard deviation.

Theoretical Insight

The paper demonstrates that the standard GRPO advantage is a special case of P-GRPO where the current batch statistics perfectly match the cluster statistics. P-GRPO introduces a bias correction term that ensures a "hard" preference (low average reward) is not penalized simply because it scores lower than an "easy" preference (high average reward) in the same batch. It ensures equitable gradient updates by comparing a user's output against their own historical baseline.

3. Key Contributions

Identification of Optimization Bias: The paper formally identifies that standard GRPO's group normalization induces systematic bias against heterogeneous preference signals, leading to the suppression of minority user experiences.
P-GRPO Algorithm: A novel modification to the GRPO objective that replaces batch-level normalization with cluster-level historical normalization, preserving contrastive signals for distinct user groups.
Efficient Implementation: The integration of Welford's algorithm allows for scalable, online maintenance of preference-specific statistics without prohibitive memory costs.
Comprehensive Evaluation: Extensive experiments across content recommendation and text generation tasks, validating that P-GRPO improves convergence speed and reward quality for diverse user groups without sacrificing general model capabilities.

4. Experimental Results

The authors evaluated P-GRPO using Qwen3 (1.7B, 8B) and Gemma-2B models across three datasets:

MovieLens-1M: A content recommendation task (predicting next movie).
Synthetic Preference Dataset: Music review generation with distinct personas (varying in tone, vocabulary, and genre preference).
Goodreads & KGRec: Book and music review generation tasks.

Key Findings:

Faster Convergence: P-GRPO consistently converged faster than standard GRPO across all models and tasks.
Higher Rewards: P-GRPO achieved higher average rewards during training, indicating better alignment with diverse user objectives.
Superior Generalization: In the MovieLens task, P-GRPO maintained higher accuracy than GRPO even when the number of candidate choices increased during testing (from 4 to 11).
Ablation Studies:
- Cluster Granularity: Finer-grained clustering (e.g., 10 clusters) yielded better performance than coarse clustering (1 cluster), confirming the necessity of capturing preference heterogeneity.
- Cluster Quality: Random cluster assignment failed to improve performance, proving that meaningful preference grouping is essential.
LLM-as-Judge: Evaluations using GPT-OSS-120B showed P-GRPO had higher win rates across all preference clusters regarding semantic quality, coherence, and alignment.
Preservation of General Capabilities: Crucially, P-GRPO did not degrade the models' general reasoning abilities. MMLU benchmark scores remained stable (within $\pm 0.06\%$ for Qwen3-8B) after personalization training.

5. Significance and Broader Impact

Equity in AI: P-GRPO addresses a fundamental fairness issue in LLM alignment. By preventing the "tyranny of the majority," it ensures that models serve diverse user populations equitably, rather than defaulting to a single, dominant preference mode.
Scalable Personalization: The method offers a computationally efficient way to personalize LLMs at the optimization level without requiring complex representation learning or test-time parameter modifications.
Future Directions: The authors note limitations regarding dynamic preference drift (users changing tastes over time) and the reliance on clustering quality. They suggest future work should incorporate privacy-preserving techniques (e.g., federated learning) and mechanisms to detect preference shifts.

In summary, P-GRPO represents a significant step forward in making LLM alignment robust to the reality of human diversity, ensuring that reinforcement learning signals are interpreted fairly across different user groups.

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

The Problem: The "One-Size-Fits-All" Menu

The Solution: Personalized GRPO (P-GRPO)

How It Works (The Analogy)

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: Personalized GRPO (P-GRPO)

Core Mechanism

Theoretical Insight

3. Key Contributions

4. Experimental Results

5. Significance and Broader Impact

More like this

Equitable Multi-Task Learning for AI-RANs

SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning

The Temporal Markov Transition Field

SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models