Reinforcement Learning for Self-Improving Agent with Skill Library

Imagine you hire a very smart, but inexperienced, personal assistant to help you manage your digital life. You ask them to book a flight, buy groceries, and split a dinner bill with friends.

The Problem:
At first, your assistant is great at following instructions. But if you ask them to do a slightly different version of the same task (like booking a flight to a different city), they often forget what they learned the first time. They have to start from scratch, re-reading the manual and clicking every single button again. They are efficient at the moment, but they don't get smarter over time. They don't build a "toolbox" of shortcuts.

The Old Solution (The "Prompt" Method):
Researchers tried to fix this by giving the assistant a giant notebook of "how-to" instructions (prompts). They'd say, "Hey, remember that time you booked a flight? Do it exactly like that!"

The Flaw: This relies on the assistant just guessing the right instructions from the notebook. Sometimes they guess wrong, or they get confused by the sheer size of the notebook. It's like trying to remember a recipe by reading a 500-page cookbook every time you want to make toast.

The New Solution: SAGE (The "Mentor & Apprentice" System)
This paper introduces a new way to train AI agents called SAGE (Skill Augmented GRPO for self-Evolution). Think of it as turning your assistant into a master craftsman who builds their own toolbox while they work.

Here is how it works, using a simple analogy:

1. The "Chain Reaction" Training (Sequential Rollout)

Instead of asking the assistant to do one task and then stopping, SAGE makes them do a chain of three similar tasks back-to-back.

Task 1: Book a flight to Paris.
Task 2: Book a flight to London.
Task 3: Book a flight to Tokyo.

As the assistant works on Task 1, they figure out a clever shortcut (a "skill") to book flights quickly. In the old way, they would just do it and forget. In SAGE, they save that shortcut into a digital toolbox.

2. The "Toolbox" (Skill Library)

When they move to Task 2 (London), they don't start from zero. They open their toolbox, find the "Flight Booking Shortcut" they just made, and use it.

The Magic: If the shortcut works perfectly, the system gives them a double bonus. They get points for finishing the task plus extra points for creating a useful tool that helped them later.
If they mess up the shortcut, they get a penalty. This teaches them to build good tools, not just any tools.

3. The "Mentor" (Supervised Fine-Tuning)

Before the assistant starts learning on their own, the researchers gave them a crash course. They showed them examples of a "Super Expert" (a very advanced AI) doing these tasks perfectly. This is like a master chef showing an apprentice the proper knife skills before letting them cook. This ensures the assistant doesn't learn bad habits right away.

4. The Result: Smarter and Faster

Because the assistant is constantly building and reusing their own toolbox:

They get faster: Instead of clicking 20 buttons to book a flight, they might just run one "Book Flight" command.
They get smarter: They learn that the "Flight Booking" tool works for any city, not just the first one they tried.
They save money: In the computer world, "tokens" are like money. By using shortcuts, the assistant uses 59% less "money" (computing power) to get the job done.

The Big Picture

The paper tested this on a complex world called AppWorld, where agents have to juggle apps like Amazon, Spotify, and Gmail.

Without SAGE: The AI was like a student who memorized the answers to one specific test but failed the next one because the numbers were slightly different.
With SAGE: The AI became like a seasoned pro. It learned the principles of the job, built a set of reusable tools, and could handle new, tricky situations with ease.

In short: SAGE teaches AI not just to do the work, but to learn how to learn, creating a personal library of shortcuts that makes them faster, cheaper, and more reliable every single time they work.

Here is a detailed technical summary of the paper "Reinforcement Learning for Self-Improving Agent with Skill Library" (SAGE).

1. Problem Statement

Large Language Model (LLM) agents have shown promise in complex reasoning and multi-turn interactions but face significant limitations when deployed in new environments:

Lack of Continual Learning: RL-trained agents often struggle to adapt to new scenarios or leverage ongoing experiences for future tasks.
Inefficiency of Prompt-Based Skill Libraries: Existing approaches to skill libraries (where agents store reusable skills) rely heavily on manual prompting. This limits the quality and adaptability of skills because they are constrained by the base model's instruction-following capabilities.
Inconsistency in Skill Generation: Previous frameworks often separate task execution from skill definition (defining skills only after task completion), leading to context length issues and learning inconsistencies.

The core challenge is to develop a framework that enables agents to autonomously generate, validate, and utilize executable skills within a library to improve both task success rates and efficiency through Reinforcement Learning (RL).

2. Methodology: SAGE Framework

The authors propose SAGE (Skill Augmented GRPO for self-Evolution), a novel RL framework built upon Group Relative Policy Optimization (GRPO). The framework consists of three core components:

A. Unified Skill Library Agent

Unlike previous methods that define skills post-hoc, SAGE adopts a unified format for both task solving and skill generation (inspired by DynaSaur).

Mechanism: When interacting with an API environment, the agent first generates a programmatic function (a skill) and then calls it to execute the task, rather than chaining raw API calls directly.
Actions: The agent can perform four actions:
1. Skill Usage: Execute a retrieved skill.
2. Skill Generation: Define a new function to solve the current task.
3. Skill Update: Modify a failed skill and retry.
4. Skill Save: Store successful skills in the library.
Initialization: To overcome the difficulty of open-source models following complex skill-library prompts, the authors first apply Supervised Fine-Tuning (SFT) using high-quality trajectories generated by an expert model (Claude 3.5 Sonnet V2).

B. Sequential Rollout

To enable end-to-end RL where skill generation in one task influences performance in subsequent tasks, SAGE introduces Sequential Rollout.

Task Chains: Instead of training on single isolated tasks, the agent is trained on chains of similar tasks (e.g., two tasks within the same scenario).
Flow: The agent solves Task 1, generating skills that are immediately added to the library. These skills are then available for use in Task 2.
Benefit: This allows rewards from the successful utilization of a skill in Task 2 to be back-propagated to the generation of that skill in Task 1, creating a feedback loop for skill quality.

C. Skill-Integrated Reward

Standard RL relies on outcome-based rewards (task success). SAGE introduces a composite reward function to explicitly encourage skill behaviors:

Formula: $R = R_{outcome} + R_{skill}$
Components:
1. Outcome-based Reward: Binary success/failure of the task.
2. Skill Generation Reward: An extra reward (+1) if a skill generated in Task 1 is successfully used to complete Task 2.
3. Skill Usage Reward: An extra reward (+1) if Task 2 successfully utilizes a skill from the library.
Penalty: A -1.0 penalty is applied if the agent terminates without generating code.

3. Key Contributions

SAGE Framework: A novel RL framework that integrates skill libraries directly into the training loop via Sequential Rollout and Skill-Integrated Rewards.
Unified Execution Format: Moving away from post-task skill definition to a unified "generate-then-execute" function format, reducing context overhead and improving learning consistency.
Efficiency and Accuracy: Demonstrating that RL with skill libraries can significantly reduce token usage and interaction steps while improving success rates, surpassing both prompting-based and standard RL baselines.
Empirical Validation: Extensive experiments on the AppWorld dataset showing that SAGE enables open-source models (Qwen2.5-32B) to outperform expert-level prompting and previous RL methods.

4. Experimental Results

Experiments were conducted on the AppWorld dataset (250 scenarios, 750 tasks) using Qwen2.5-32B-Instruct as the base model.

Performance Metrics:
- Scenario Goal Completion (SGC): SAGE achieved 60.7%, a 8.9% absolute improvement over the baseline GRPO (51.8%) and significantly higher than prompting-based methods.
- Task Goal Completion (TGC): SAGE achieved 72.0% vs. 69.2% for baseline GRPO.
Efficiency Metrics:
- Interaction Steps: Reduced by 26% (12.1 steps vs. 16.4 for baseline).
- Token Generation: Reduced by 59% (1,475 tokens vs. 3,613 for baseline).
Skill Utilization Analysis:
- Agents trained with SAGE showed >2x success rates when utilizing learned skills compared to baseline models.
- SAGE significantly improved the "Success Skill Usage Rate," indicating the model learned not just to generate skills, but to generate useful ones.
Ablation Studies:
- SFT Initialization: Crucial for performance; training directly from the base model yielded poor results.
- Reward Design: The Skill-Integrated Reward outperformed pure outcome-based or chain-based rewards, proving the necessity of explicitly rewarding skill generation and usage.
- Task Chain Length: A 2-task chain was found to be optimal; longer chains (3 tasks) introduced reward imbalance and gradient variance issues.

5. Significance

This paper represents a significant step forward in autonomous agent self-improvement.

Beyond Prompting: It moves the paradigm from static, prompt-engineered skill libraries to dynamic, RL-optimized skill acquisition.
Scalability: By proving that open-source models can surpass expert prompting when equipped with a skill library and proper RL training, it lowers the barrier for deploying high-performance agents.
Efficiency: The drastic reduction in tokens and interaction steps suggests that skill libraries are a viable path toward making LLM agents more cost-effective and practical for real-world, long-horizon tasks.
Generalizability: While tested on AppWorld, the framework of Sequential Rollout and Skill-Integrated Rewards offers a blueprint for improving agents in other tool-using domains (e.g., coding, web browsing, robotics).

The code for SAGE is open-sourced at https://github.com/amazon-science/SAGE.