Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

The Big Problem: The "Fickle Chef"

Imagine you go to a restaurant and order the exact same dish, "The Special," but you ask the waiter two slightly different ways:

"Can I get the Special?"
"I'd like to order the Special, please."

In a perfect world, you get the exact same plate of food both times. But in the current world of AI (Large Language Models), the AI is like a fickle chef. If you ask the first way, you might get a steak. If you ask the second way, you might get a salad. Even though you asked for the same thing, the AI changes its mind based on tiny differences in how you phrased the question.

This is a huge problem for businesses. Imagine a bank chatbot telling one customer, "You can't get a loan," and telling another customer (who asked the exact same question but used different words), "Yes, you can!" This creates confusion, breaks trust, and can even get the company sued.

The Old Solutions: Why They Didn't Work

Scientists tried a few things to fix this:

The "Retrieval" Method (RAG): This is like giving the chef a cookbook and saying, "Only cook what's in this book." It helps, but the chef can still interpret the instructions differently depending on how you ask.
The "Temperature" Method: This is like telling the chef, "Stop being creative and just follow the recipe exactly." It makes the chef less random, but it doesn't guarantee they will give you the same dish if you ask a slightly different question.

The New Solution: The "Group Training" (GRPO)

The authors of this paper propose a new training method called Group Relative Policy Optimization (GRPO).

Think of GRPO as a new kind of coach for the AI. Instead of teaching the AI one question at a time, the coach brings in a whole group of people asking the same question in different accents and dialects.

Here is how the training works:

The Group: The coach gathers 6 people. One says, "I'm a boy looking for a job." Another says, "I'm a girl looking for a job." Another says, "I'm a man seeking employment." They are all asking for the same thing.
The Test: The AI answers all 6 of them.
The Critique: The coach looks at the answers.
- Bad AI: Gives the boy a list of construction jobs and the girl a list of nursing jobs. The coach says, "No! You are being inconsistent! You are letting the gender change the advice."
- Good AI: Gives everyone the exact same list of high-paying, skilled jobs. The coach says, "Perfect! You treated the group fairly."
The Reward: The AI gets a "gold star" (a reward) only if all 6 answers are consistent with each other and actually helpful. If the answers drift apart, the AI loses points.

The Secret Sauce: "Entropy"

The paper uses a math concept called Entropy to measure this. Think of entropy as a measure of "information richness" or "how much detail is in the answer."

The AI is trained to do two things at once:

Be Helpful: Don't give short, boring answers. Give rich, detailed advice.
Be Stable: Make sure the amount and type of detail is the same for everyone in the group.

If the AI gives a long, detailed answer to the "boy" but a short, vague answer to the "girl," it fails the stability test. The goal is to make the AI realize that the core information shouldn't change just because the words changed.

The Results: What Happened?

The researchers tested this on a small AI model (Llama-3) using questions about jobs and investments.

Before Training: When they asked, "What jobs should a woman look for?" vs. "What jobs should a man look for?", the AI gave very different lists. It was biased and inconsistent.
After Training (GRPO): The AI started giving the exact same high-quality, detailed advice to both men and women. The "gap" between the answers disappeared.

Why Does This Matter?

This paper is important because it treats inconsistency not as a cool feature of AI (like "creative diversity"), but as a bug that needs to be fixed.

In the real world, we don't want our AI to be "creative" when it comes to rules, laws, or financial advice. We want it to be a reliable robot. If you ask a human lawyer the same legal question in two different ways, they should give you the same answer. This paper teaches AI to do the same thing.

In a nutshell: The authors taught AI to stop being a mood ring that changes its answer based on how you ask, and start being a reliable encyclopedia that gives the same truth, no matter who is asking or how they phrase it.

1. Problem Definition

Large Language Models (LLMs) deployed in business-critical domains (finance, healthcare, HR, education) often exhibit inconsistency: semantically equivalent prompts yield divergent outputs based on minor phrasing differences or demographic attributes (e.g., gender).

The Core Issue: While personalization is desirable in some contexts, enterprise scenarios require invariant information delivery. For example, a job recommendation or financial disclosure should remain factually identical regardless of whether the user identifies as male or female, or whether the prompt is phrased as a question or a statement.
Limitations of Existing Solutions:
- Retrieval-Augmented Generation (RAG): Grounds answers in external data but does not eliminate inherent stochasticity in the generation process when retrieval results are identical.
- Temperature Tuning: Reduces randomness but cannot guarantee identical outputs for equivalent inputs.
- Standard Fine-tuning: Often focuses on accuracy or instruction following, not explicitly minimizing variance across prompt variants.
The Gap: There is a lack of methods that treat consistency as a primary optimization objective using reinforcement learning, specifically for context-free interactions where external grounding is unavailable.

2. Methodology

The authors propose a novel application of Group Relative Policy Optimization (GRPO) to enforce information consistency.

A. Problem Formulation

The goal is to minimize the variance of information content ( $H$ ) across a group of semantically equivalent prompts ( $G = \{q_1, ..., q_K\}$ ).

Objective: $\text{Var}_G[H(r(q))] \approx 0$ .
Setup: The model is tested on "fresh" conversations (no prior context) where prompts differ only by a specific attribute (e.g., gender pronouns) while maintaining semantic equivalence.

B. Reward Functions

The framework utilizes a composite reward signal ( $R$ ) combining two components:

Helpfulness (Information) Reward: Based on Shannon Entropy ( $H(r)$ $H (r)$ ).
- $H(r) = -\sum p(v) \log p(v)$ .
- Normalized to ensure the model does not produce trivial, low-entropy (short/boring) responses. High entropy indicates rich, complete information.
Stability (Consistency) Reward: Based on the Entropy Gap between variants.
- Calculates the absolute difference in entropy between responses to semantically equivalent prompts (e.g., male vs. female variants).
- Penalizes large gaps, encouraging the model to produce responses with similar informational density regardless of the prompt variant.

The composite objective is: $R = \alpha H_{norm} + \beta F_{norm}$ , where $\beta$ can be prioritized in high-stakes domains to enforce stability.

C. GRPO Optimization

Unlike standard PPO which optimizes single-sample performance, GRPO optimizes a group of samples per prompt.

Group-Relative Advantage: Instead of comparing a sample to a global baseline, GRPO computes the advantage relative to the mean of the group.
- $\hat{A}^{(k)} = \frac{R^{(k)} - \text{mean}(R)}{\text{std}(R)}$ .
Mechanism: This forces the policy to align the outputs within the group. If one variant produces a high-entropy response and another low-entropy one, the advantage signal pushes the model to reduce this dispersion, effectively "smoothing" the output distribution across equivalent inputs.

3. Experimental Setup

Model: Llama-3.2-1B-Instruct (fine-tuned using Unsloth's GRPO implementation with LoRA).
Dataset: RealWorldQuestioning Benchmark (870 gendered questions from Reddit, Quora, MarketWatch).
- Domains: Job Recommendations and Investment/Financial Advice.
- Variation: Pairs of prompts differing only in gender markers (e.g., "I am a boy" vs. "I am a girl").
Training: 6 generations per prompt group, 250 steps, batch size 1 (gradient accumulation 4).
Evaluation Metrics:
- Mean Shannon Entropy (to measure informativeness).
- Entropy Gap (difference between male/female variants).
- Statistical significance (t-tests) of the differences.

4. Key Results

The study demonstrates that GRPO fine-tuning significantly reduces variability compared to the baseline model.

Quantitative Improvement:
- Job Recommendations: The baseline model showed a statistically significant difference in entropy between male and female prompts ( $p=0.07$ ). After GRPO fine-tuning, the difference became negligible ( $p=0.84$ ), with mean entropy for both groups converging to 4.56.
- Investment Recommendations: Similarly, the baseline showed variance ( $p=0.16$ ), which was eliminated post-fine-tuning ( $p=0.72$ ), with means converging to ~4.45–4.48.
Qualitative Observation:
- Baseline models often gave divergent advice (e.g., suggesting different career paths or investment strategies based on gender).
- GRPO-trained models produced consistent, information-rich recommendations regardless of the gender marker, effectively neutralizing the demographic bias in the structure and content of the advice.

5. Key Contributions

Novel Application of GRPO: This is the first known application of Group Relative Policy Optimization outside of reasoning and code generation, repurposing it to enforce information consistency across semantically equivalent prompts.
Consistency as a Primary Objective: The paper reframes variability not as an acceptable feature of generative diversity, but as a correctable flaw in enterprise deployments, introducing a specific reward function to minimize cross-variant dispersion.
Entropy-Based Stability: Proposes using Shannon entropy as a model-agnostic proxy for "information richness" and "stability," allowing the model to learn to be both helpful and invariant without requiring a specific "ground truth" answer (which is often ambiguous in advisory tasks).
Context-Free Consistency: Addresses the critical gap of ensuring consistency in direct user interactions where external retrieval (RAG) is not available.

6. Significance and Limitations

Significance:

Trust & Compliance: Provides a technical solution for regulatory compliance (e.g., avoiding disparate treatment in hiring or lending) and building user trust in AI systems.
Enterprise Readiness: Offers a scalable reinforcement learning approach to make LLMs reliable for high-stakes business applications where "the same question must yield the same answer."
Beyond RAG: Proves that internal model alignment can achieve consistency even without external knowledge grounding.

Limitations:

Scope: Experiments were limited to gender-based variations; future work is needed for paraphrasing, tone, and cross-lingual variations.
Context: The study focused on single-turn, context-free interactions. Multi-turn dialogue consistency remains an open challenge.
Metrics: Reliance on entropy as a proxy for consistency may not capture all qualitative nuances of user satisfaction or factual accuracy.
Model Size: Experiments were conducted on a 1B parameter model; results may vary with larger architectures.

Conclusion

The paper establishes that Group Relative Policy Optimization (GRPO), when guided by entropy-based helpfulness and stability rewards, is an effective mechanism for enforcing information consistency in LLMs. By treating semantically equivalent prompts as a group and minimizing intra-group variance, the authors successfully mitigated demographic bias and phrasing-induced instability, offering a robust path toward reliable, enterprise-grade AI deployment.