Partial Policy Gradients for RL in LLMs

The Big Problem: The "Forgetful Actor"

Imagine you hire an actor to play a specific character (a "persona") in a long movie. Let's say the character is a grumpy old baker who loves cats and hates coffee.

In the first scene, the actor is perfect. They grumble about the coffee and pet the cat. But as the movie gets longer (say, 60 scenes long), the actor starts to drift.

Scene 30: They accidentally say, "I love a good espresso!"
Scene 40: They forget they have a cat and say, "I'm allergic to fur!"
Scene 50: They are suddenly a cheerful barista who loves coffee.

This is called Persona Drift. Large Language Models (LLMs) are great at acting, but they tend to forget their character's backstory as the conversation gets longer.

The Current Solution: The "All-or-Nothing" Approach

Currently, the most popular way to train these actors is like a director who only gives feedback at the very end of the movie.

The Director says: "Great job on the whole movie! But you forgot you hated coffee in Scene 40, so the whole performance gets a low score."
The Actor's reaction: "Oh no! I have to remember every single thing I said from Scene 1 to Scene 60 to get a good score."

This is overwhelming. The actor gets confused, tries to fix one mistake, creates a new one, and ends up oscillating (swinging back and forth) between being the grumpy baker and the cheerful barista. They can't handle the pressure of remembering the entire history at once.

The Paper's Solution: "Partial" Feedback

The authors of this paper propose a smarter way to direct the actor. Instead of judging the whole movie at once, they break the feedback down into smaller chunks.

They call this Partial Policy Gradients. Think of it as a director who gives feedback based on how far ahead the actor can see:

1. The "Greedy" Actor (Looking 1 Step Ahead)

The Strategy: "Just make sure this line is perfect. Don't worry about what happens next."
The Result: The actor is very consistent in the moment but gets confused quickly. They might say, "I hate coffee," then immediately say, "But I love a latte," because they are only thinking about the next sentence. They keep flipping back and forth (oscillating).
Analogy: Like a person who only looks at the road directly in front of their car. They avoid the pothole right now but might drive off a cliff in 10 seconds.

2. The "Full Planner" (Looking All the Way to the End)

The Strategy: "Remember every single line from the beginning to the end. The whole movie must make sense."
The Result: This works great for complex stories (like a math tutoring session) where you need to build a long argument. But for casual chatting, it's too much pressure. The actor gets so stressed trying to remember everything that they freeze up or make huge, unrealistic jumps in personality.
Analogy: Like a chess grandmaster trying to calculate every possible move for the next 50 turns. It's too much data, and they make mistakes because they are overwhelmed.

3. The "K-Step Lookahead" (The Sweet Spot)

The Strategy: "Look about 2 to 3 steps ahead. Make sure your next few lines fit together, but don't stress about the whole movie yet."
The Result: This is the Goldilocks zone.
- If you are Chatting (casual), looking 2 steps ahead is perfect. It keeps the conversation flowing naturally without getting bogged down in deep planning.
- If you are in Therapy, looking 3 steps ahead works best. It allows the actor to handle emotional nuances without over-planning a fake "happy ending."
- If you are Teaching, looking all the way ahead (Full Planning) is best because you need to build a long lesson plan.

Why This Matters: The "Data Budget" Analogy

The paper also discovered something cool about how much data you have to train the actor.

Low Data (You only have 50 practice scripts): You need the Greedy actor. They are simple and easy to learn quickly. If you try to teach the "Full Planner" with only 50 scripts, they will get confused and fail completely.
High Data (You have 5,000 practice scripts): Now you can teach the Full Planner. With enough examples, they can learn the complex rules of the whole movie.

The Lesson: Don't try to teach a complex strategy to a student who hasn't seen enough examples yet. Start simple, then get more complex as they learn more.

Summary: The "Lookahead" Dial

The paper introduces a simple "dial" (called K) that controls how far into the future the AI looks when it learns.

Turn the dial to 1 (Greedy): Good for quick, simple tasks with little data.
Turn the dial to 2 or 3 (K-Step): The sweet spot for keeping a character consistent in long conversations (like therapy or chatting).
Turn the dial to Max (Full Planning): Best for complex, long-term goals (like education) if you have lots of data.

The Bottom Line: By telling the AI to only worry about the next few steps instead of the entire future, we can train it to be a much more consistent, reliable, and human-like character, especially in long conversations. It's the difference between an actor who forgets their lines and one who stays in character until the final curtain call.

1. Problem Statement

Reinforcement Learning (RL) is widely used to align Large Language Models (LLMs) with specific behaviors, such as maintaining a consistent human persona in role-playing dialogues. However, current approaches face significant challenges:

Persona Drift: LLMs often lose consistency over long conversations (20–60+ turns), contradicting their assigned background or previous statements.
Statistical Inefficiency: Standard policy gradient methods (like PPO or GRPO) attribute the total trajectory reward equally to all tokens. In long-horizon tasks, this creates high-variance gradient estimates, making learning unstable and data-inefficient.
Complexity vs. Data Trade-off: Full planning (considering the entire future trajectory) is theoretically optimal but requires massive amounts of data to learn reliably. Conversely, greedy approaches (optimizing only the immediate step) are data-efficient but fail to maintain long-term consistency.

The paper asks: Can we design a policy gradient framework that balances the complexity of the policy with the statistical efficiency of learning, specifically for maintaining consistent personas in LLMs?

2. Methodology: Partial Policy Gradients (PPG)

The authors propose a general framework called Partial Policy Gradients (PPG). The core idea is to optimize for a subset of future rewards rather than the total cumulative reward of the entire trajectory.

Core Concepts

Reward Decomposition: The total reward $r(x, \tau_n)$ is decomposed additively over time steps: $r(x, \tau_n) = \sum_{t=1}^n r_t$ .
Partial Attribution: Instead of attributing the full future reward to every action, PPG defines a set of reward indices $R_t$ $R_{t}$ that are affected by the action at step $t$ $t$ .
- The gradient update for step $t$ uses only the sum of rewards in $R_t$ : $\sum_{\ell \in R_t} r_\ell$ .
- This creates a "lookahead horizon" $K$ , where an action at step $t$ is only responsible for rewards in steps $t$ to $t+K$ .

Policy Classes Derived from PPG

The framework unifies several policy types as instances of different $R_t$ definitions:

Full Planning (PG): $R_t = [n] \setminus [t-1]$ . The action affects all future rewards. (Standard Policy Gradient).
Greedy Policy (GreedyPG): $R_t = \{t\}$ . The action affects only the immediate reward.
K-Step Lookahead (K-Step-PG): $R_t = [t, t+K-1]$ . The action affects the next $K$ rewards. This is the novel contribution of the paper.
Segment Policies: Rewards are attributed to specific segments of the trajectory.

Theoretical Foundation

Statistical Efficiency: The authors prove (via Hoeffding's inequality) that optimizing for smaller subsets of rewards leads to faster concentration of the empirical gradient estimator.
Variance Reduction: By reducing the scope of the reward attribution, the variance of the gradient estimate decreases. This allows simpler policies (smaller $K$ ) to be learned more reliably with less data, while complex policies (large $K$ ) require more data to overcome higher variance.
Offline & Online: The framework supports both online learning (sampling from the current policy) and offline learning (using a fixed dataset), with specific adjustments for importance weighting in the offline setting.

3. Key Contributions

General Framework: A unified formulation for policy gradients that optimizes subsets of future rewards, generalizing existing methods like greedy optimization and segment-level credit assignment.
K-Step Lookahead Policies: The first proposal and empirical evaluation of K-step lookahead policies specifically for LLMs. This allows tuning the planning horizon ( $K$ ) to match the problem complexity.
Theoretical Analysis: Formal proof that partial policy gradients concentrate faster than full policy gradients, establishing a theoretical trade-off between policy complexity and sample efficiency.
Comprehensive Evaluation: Extensive experiments on persona-alignment tasks across four domains (Education, Therapy, Chatting, Generic) using three LLM architectures (Llama, Qwen, Gemma).

4. Experimental Results

The authors evaluated their policies on the Consistent-LLMs benchmark, measuring Persona Consistency (PC) (alignment with the assigned persona and lack of contradictions).

Key Findings

Superiority over Baselines: All PPG variants outperformed the unmodified Base model and the standard PPO baseline across all domains.
Domain-Specific Optimality: The optimal lookahead horizon $K$ $K$ depends heavily on the domain:
- Education: Full Planning (PG) performed best. Educational dialogues require long-term pedagogical strategies where early learning states must connect to later skill development.
- Therapy: 3-Step Lookahead was optimal. Therapy requires balancing immediate empathy with short-term therapeutic goals; full planning led to unrealistic "over-optimization" (e.g., miraculous recoveries), while greedy approaches were too myopic.
- Chatting: 2-Step Lookahead was optimal. Casual conversations are reactive and step-wise; longer horizons introduced unnecessary complexity and instability.
Stability vs. Drift:
- Base Models: Showed monotonic degradation (drift) as conversation length increased.
- GreedyPG: Showed high-frequency oscillation ("ripples"), constantly trying to correct errors but failing to plan ahead, leading to instability.
- K-Step-PG: Maintained smooth, stable consistency across long trajectories (50+ steps).
Statistical Efficiency (Data Scaling):
- In low-data regimes (e.g., 50 trajectories), GreedyPG outperformed complex policies because it learned faster with lower variance.
- In high-data regimes (e.g., 5,000 trajectories), Full Planning (PG) eventually surpassed simpler policies, confirming the trade-off: complex policies need more data to converge.

5. Significance and Impact

Solving Persona Drift: The paper provides a practical solution to the "persona drift" problem in LLMs, showing that bounded lookahead is often superior to full planning for maintaining consistency in social interactions.
Design Principle for RLHF: It establishes a critical design principle: Calibrate policy complexity to data availability. Practitioners should use greedy or short-horizon policies when data is scarce and gradually increase the lookahead horizon ( $K$ ) as more training data becomes available.
Generalizability: The framework is model-agnostic (works across Llama, Qwen, Gemma) and can be applied to other RL tasks beyond persona alignment, such as regularized policies or GRPO.
Interpretability: The results offer a clear explanation for why certain RL algorithms fail in specific domains (e.g., over-planning in therapy vs. under-planning in education), moving beyond black-box tuning to principled algorithm selection.

In summary, this work bridges the gap between theoretical RL efficiency and practical LLM alignment by introducing Partial Policy Gradients, demonstrating that optimizing for a subset of future rewards is a powerful mechanism to stabilize long-horizon LLM behaviors.