Reinforcement Learning for Self-Improving Agent with Skill Library

This paper introduces SAGE, a novel Reinforcement Learning framework that enhances LLM-based agents' self-improvement capabilities by utilizing a skill library with sequential rollouts and skill-integrated rewards, achieving significantly higher goal completion rates and greater efficiency than existing methods on the AppWorld benchmark.

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you hire a very smart, but inexperienced, personal assistant to help you manage your digital life. You ask them to book a flight, buy groceries, and split a dinner bill with friends.

The Problem:
At first, your assistant is great at following instructions. But if you ask them to do a slightly different version of the same task (like booking a flight to a different city), they often forget what they learned the first time. They have to start from scratch, re-reading the manual and clicking every single button again. They are efficient at the moment, but they don't get smarter over time. They don't build a "toolbox" of shortcuts.

The Old Solution (The "Prompt" Method):
Researchers tried to fix this by giving the assistant a giant notebook of "how-to" instructions (prompts). They'd say, "Hey, remember that time you booked a flight? Do it exactly like that!"

  • The Flaw: This relies on the assistant just guessing the right instructions from the notebook. Sometimes they guess wrong, or they get confused by the sheer size of the notebook. It's like trying to remember a recipe by reading a 500-page cookbook every time you want to make toast.

The New Solution: SAGE (The "Mentor & Apprentice" System)
This paper introduces a new way to train AI agents called SAGE (Skill Augmented GRPO for self-Evolution). Think of it as turning your assistant into a master craftsman who builds their own toolbox while they work.

Here is how it works, using a simple analogy:

1. The "Chain Reaction" Training (Sequential Rollout)

Instead of asking the assistant to do one task and then stopping, SAGE makes them do a chain of three similar tasks back-to-back.

  • Task 1: Book a flight to Paris.
  • Task 2: Book a flight to London.
  • Task 3: Book a flight to Tokyo.

As the assistant works on Task 1, they figure out a clever shortcut (a "skill") to book flights quickly. In the old way, they would just do it and forget. In SAGE, they save that shortcut into a digital toolbox.

2. The "Toolbox" (Skill Library)

When they move to Task 2 (London), they don't start from zero. They open their toolbox, find the "Flight Booking Shortcut" they just made, and use it.

  • The Magic: If the shortcut works perfectly, the system gives them a double bonus. They get points for finishing the task plus extra points for creating a useful tool that helped them later.
  • If they mess up the shortcut, they get a penalty. This teaches them to build good tools, not just any tools.

3. The "Mentor" (Supervised Fine-Tuning)

Before the assistant starts learning on their own, the researchers gave them a crash course. They showed them examples of a "Super Expert" (a very advanced AI) doing these tasks perfectly. This is like a master chef showing an apprentice the proper knife skills before letting them cook. This ensures the assistant doesn't learn bad habits right away.

4. The Result: Smarter and Faster

Because the assistant is constantly building and reusing their own toolbox:

  • They get faster: Instead of clicking 20 buttons to book a flight, they might just run one "Book Flight" command.
  • They get smarter: They learn that the "Flight Booking" tool works for any city, not just the first one they tried.
  • They save money: In the computer world, "tokens" are like money. By using shortcuts, the assistant uses 59% less "money" (computing power) to get the job done.

The Big Picture

The paper tested this on a complex world called AppWorld, where agents have to juggle apps like Amazon, Spotify, and Gmail.

  • Without SAGE: The AI was like a student who memorized the answers to one specific test but failed the next one because the numbers were slightly different.
  • With SAGE: The AI became like a seasoned pro. It learned the principles of the job, built a set of reusable tools, and could handle new, tricky situations with ease.

In short: SAGE teaches AI not just to do the work, but to learn how to learn, creating a personal library of shortcuts that makes them faster, cheaper, and more reliable every single time they work.