Original authors: Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang

Published 2026-05-25✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

Original authors: Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a personal assistant robot. In the past, we taught these robots to be "correct." If you asked, "Plan a trip to Tokyo," the robot would learn the single, mathematically perfect itinerary that works for the average person. It would be efficient, logical, and factually accurate.

But in the real world, "correct" isn't enough. If User A is a quiet museum lover who hates walking, and User B is an energetic anime fan who loves nightlife, the "perfect" Tokyo trip for them is completely different. The same question requires two different answers.

This paper proposes a new way to train AI agents so they stop trying to be a "one-size-fits-all" expert and start becoming a true personal companion. Here is how they did it, explained simply:

1. The Problem: The "Average" Trap

Current AI training is like teaching a chef to cook a single "average" meal that everyone likes. If you ask for a spicy dish, the chef might give you something mild because they are trying to please the majority.

The Issue: Real users have unique tastes, habits, and constraints. A generic reward system (like a score for "did you finish the task?") can't tell the difference between a trip plan that is factually correct but boring to the user, versus one that is perfectly tailored to them.
The Noise: Sometimes users act in ways that don't match their true desires (maybe they bought something just because their friends did). The AI needs to figure out what the user truly wants, not just what they did.

2. The Solution: A Three-Part Toolkit

The authors built a framework called PARPO (Personalized Anchor Reward-Decoupled Policy Optimization). Think of it as a three-step upgrade for the AI's brain:

Part A: The "Dual-Track" Coach (PARPO)

Imagine a sports coach training two athletes at the same time.

Track 1 (The Basics): The coach ensures both athletes run a perfect, safe lap. This is the General Quality reward. Did they finish the race? Did they follow the rules?
Track 2 (The Personal Style): The coach then gives specific feedback based on the athlete's style. For the sprinter, it's "go faster." For the marathon runner, it's "conserve energy." This is the Personalized Preference reward.
The Anchor: To keep things stable, the coach uses a "personal anchor" for each athlete. Instead of comparing the sprinter to the marathon runner (which is unfair), the coach compares the sprinter to their own past performance. This stops the AI from getting confused by the different "scales" of different users.

Part B: The "True Interest" Detector (Reward Model)

How does the AI know what a user actually likes versus what they just did because of peer pressure?

The paper introduces a Two-Stage Detector.
- Stage 1: It builds a profile of the user from many angles (like reading their bio, their history, and their social circle).
- Stage 2: It acts like a detective separating "True Interest" from "Conformity." It asks: "Did this user do this because they love it, or just because everyone else was doing it?" It filters out the noise to find the signal.

Part C: The "Living Library" (PSGM)

Old AI memory is like a flat pile of papers. You ask a question, and it searches the whole pile.

This paper builds a Skill Evolution Graph. Imagine a dynamic, 3D spiderweb where every node is connected.
- One node is "User A."
- It connects to "Skill: Museum Planning."
- That connects to "Scenario: Rainy Day."
- And "Tool: Ticket Booking."
When a user asks a question, the AI doesn't just search; it travels through this web to find the exact skills and tools that match that specific user's history and preferences. It's like a librarian who knows exactly which book you liked last year and suggests a similar one, rather than just handing you the best-selling book.

3. The Results: Better Than the Rest

The team tested this on three different challenges:

ETAPP: A standard test for personal assistants (planning daily tasks).
ETAPP-Hard: A tougher version with complex, multi-step problems.
SJAgent: A real-world industrial test using data from a massive Chinese e-commerce platform (helping merchants make decisions).

The Outcome:
Their new framework consistently beat the best existing methods.

It didn't just get the facts right; it got the vibe right.
It learned to be proactive (anticipating needs) and followed complex procedures better.
Crucially, it maintained high quality while adapting to individual users, proving that you don't have to sacrifice "correctness" to be "personal."

Summary Analogy

Think of the old AI as a tour guide who memorized one perfect script for Tokyo and recited it to everyone.
The new AI is a local friend who knows you personally. They know you hate walking, love anime, and are on a budget. They don't just give you a map; they design a day that feels like it was made just for you, using their memory of what you've liked before, while still making sure you actually see the sights you wanted to see.

The paper claims this is achieved by separating "doing the job right" from "doing the job the way you like," and using a smart memory system to remember exactly who you are.

Technical Summary: From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

1. Problem Definition

While Agentic Reinforcement Learning (Agentic RL) has achieved significant success in verifiable tasks with clear ground-truth answers (e.g., code generation, web navigation), it faces fundamental challenges in real-world applications where optimal behavior is user-dependent. In domains such as e-commerce assistance, travel planning, and daily scheduling, a single query (e.g., "plan a one-day trip in Tokyo") admits multiple plausible trajectories, with the preferred path determined by individual user preferences, habits, and constraints.

Existing methods typically optimize for generic objectives (overall quality, helpfulness) or perform personalization only at inference time via prompting or memory retrieval. They lack a native training-time framework to optimize policies for user-contingent trajectories. This setting introduces three core challenges:

Personalized Reward Ambiguity: Generic rewards capture task correctness but fail to express how specific users evaluate trajectories or handle heterogeneous reward scales across users.
Preference Disentanglement: Observed user behaviors are often entangled with intrinsic interests and external conformity or contextual effects, making preference signals noisy.
User-Aware Memory: Existing agent memories are often flat and query-centric, failing to model structured relations among users, intents, skills, tools, and scenarios required for personalized retrieval.

2. Methodology

The authors propose a unified Personalized Agentic RL framework that embeds personalization into the training-time optimization loop. The framework operates as a closed loop of preference identification, policy optimization, and structured skill accumulation, comprising three core components:

2.1 PARPO: Personalized Anchor Reward-Decoupled Policy Optimization

PARPO is the core policy optimization algorithm designed to handle heterogeneous user preferences.

Reward Decoupling: It separates the optimization into two tracks: a Base Track for generic task quality (correctness, logical coherence) and a Personalized Track for user-contingent preference improvement.
User-Specific Anchors: To stabilize learning under heterogeneous reward scales, PARPO maintains a persistent, user-specific anchor (running mean and variance) for personalized rewards.
Advantage Estimation:
- The Base Advantage ( $A_{base}$ ) uses standard within-group relative normalization.
- The Personalized Advantage ( $A_{pers}$ ) uses a user-aware baseline: $b_{u,g} = \max(\bar{R}_{pers}^{(g)}, m_u - \gamma_p \sqrt{v_u})$ , where $m_u$ and $v_u$ are the user's historical reward statistics. This prevents the baseline from drifting too far above the user's historical center.
- The total advantage is a weighted sum: $A_{total} = w_{base}A_{base} + w_{pers}A_{pers}$ .
Theoretical Justification: The authors prove that under heterogeneous preferences, user-aware optimization is never worse than user-agnostic optimization. They demonstrate that standard GRPO incurs structural bias due to pooled baselines and normalization, whereas PARPO reduces this bias through reward decomposition and anchor calibration.

2.2 Two-Stage Preference-Disentangled Reward Model

To provide clean personalized supervision, the framework employs a reward model that separates intrinsic interests from conformity.

Stage 1 (Multi-view Profile Representation): Constructs user embeddings by fusing multiple semantic views of the user profile using attention mechanisms and reconstruction losses to preserve view-specific information.
Stage 2 (Collaborative Disentanglement): Utilizes a LightGCN-based graph to propagate collaborative signals. It learns two distinct branches:
- Interest Encoder: Upweights less popular items to capture intrinsic preferences.
- Conformity Encoder: Upweights popular items to capture conformity effects.
- Orthogonality Regularization: Ensures the two branches remain distinct.
The final personalized score is a fused representation of these branches, calibrated and integrated with LLM-based evaluation.

2.3 Preference-Aligned Skill Evolution Graph Memory (PSGM)

To support personalized rollout contexts, PSGM replaces flat retrieval with a heterogeneous graph memory.

Structure: The graph nodes represent users, skills, tools, scenarios, and trajectories. Edges encode ownership, applicability, complementarity, conflict, and execution history.
Community Detection: Hierarchical community detection (Leiden/Louvain) organizes users and skills into communities to capture multi-granularity structure.
Retrieval Mechanism:
1. Semantic Initialization: Retrieves top- $K$ skills based on query similarity.
2. 2-Hop Expansion: Expands candidates from the skill to the owner user, and then to that user's sibling skills, injecting personalized local structure.
3. Graph-Aware Scoring: Ranks candidates based on query-skill similarity, user-skill similarity, community relevance, complementarity, and conflict penalties.

3. Key Contributions

Problem Formulation: The paper formulates personalized Agentic RL as a user-conditioned Markov Decision Process (MDP) where optimal behavior depends on individual preferences rather than a single ground truth.
PARPO Algorithm: Proposes an anchor-stabilized, reward-decoupled policy optimization method that effectively learns personalized policies under heterogeneous user reward scales.
Disentangled Supervision & Memory: Introduces a two-stage preference-disentangled reward model to isolate true interests from conformity, and a structured Skill Evolution Graph Memory (PSGM) for preference-aligned skill retrieval.
Empirical Validation: Demonstrates consistent gains across multiple benchmarks, showing that the framework improves personalization and procedural quality while maintaining factual and logical integrity.

4. Experimental Results

The framework was evaluated on ETAPP, ETAPP-Hard (a more challenging split requiring multi-tool coordination and implicit reasoning), and SJAgent (a real-world industrial scenario from a Chinese e-commerce platform).

Performance: The proposed method (PARPO + PSGM) significantly outperformed strong baselines, including prompting methods (ReAct), memory-based agents (Mem0), and various RL algorithms (GRPO, DAPO, GSPO, GiGPO, SkillRL).
- On ETAPP-Hard, it achieved the highest "Judge" scores and "Personal" scores, indicating robustness in complex personalized scenarios.
- On SJAgent, it led in key dimensions such as Data Authenticity, Business Logic, and Task Completion, demonstrating cross-domain generalization.
Ablation Studies:
- Removing skill memory caused the largest drop in performance, confirming its centrality to personalized decision-making.
- Replacing PARPO with standard GRPO or removing user-anchor calibration resulted in significant performance degradation, validating the necessity of the decoupled, anchor-stabilized approach.
- Disentangling the reward model (removing interest/conformity branches) also reduced performance, highlighting the importance of separating true preferences from noise.
Human & LLM Evaluation: In a blinded study on 20 ETAPP tasks, PARPO achieved the highest average scores from both human experts and LLM judges, particularly in "User Relevance," confirming that the improvements were due to genuine personalization rather than just fluency.
Training Dynamics: PARPO showed superior training stability, higher success rates, and better tool-call success compared to other RL strategies, with stable KL divergence indicating efficient policy improvement without excessive deviation.

5. Significance and Limitations

Significance:
The paper argues that personalization fundamentally changes the optimization target of Agentic RL. By moving beyond "one-size-fits-all" policies to user-contingent trajectory optimization, the proposed framework bridges the gap between generic task competence and user-specific alignment. It demonstrates that training-time optimization, supported by disentangled reward modeling and structured memory, is essential for agents operating in real-world, preference-driven environments.

Limitations:
The authors acknowledge that the scale of human evaluation is limited due to annotation costs, with judgments provided by only 15 experts on 20 sampled examples. While these results align with LLM evaluations, the authors note that future work should expand human studies to larger, more diverse pools to better assess robustness and real-world validity. Additionally, the current implementation relies on specific graph structures and anchor mechanisms that may require adaptation for different application domains.

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning