XSkill: Continual Learning from Experience and Skills in Multimodal Agents

XSkill is a dual-stream framework that enables multimodal agents to continually improve their reasoning and tool orchestration in open-ended settings without parameter updates by extracting and retrieving complementary visual-grounded knowledge in the form of action-level experiences and task-level skills.

Guanyu Jiang (May), Zhaochen Su (May), Xiaoye Qu (May), Yi R. (May), Fung

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you have a very smart robot assistant (a "Multimodal Agent") that can see pictures, read text, and use tools like a magnifying glass, a search engine, or a code calculator to solve problems.

Right now, these robots are like brilliant but forgetful interns. They are great at solving a problem once, but if you ask them a similar problem the next day, they often forget what they learned yesterday. They might try the same wrong approach over and over, or they might get confused by a tricky image (like a picture that is upside down) because they don't have a "mental note" to check for that.

XSKILL is a new system designed to turn these forgetful interns into seasoned veterans without needing to retrain their brains (which is expensive and slow). It does this by giving them two specific types of "cheat sheets" that they build up over time: Skills and Experiences.

Here is how it works, using simple analogies:

1. The Two Types of Cheat Sheets

Think of the robot's memory as a library with two different sections:

  • Skills (The "Recipe Book"):

    • What it is: These are structured, step-by-step guides for complex tasks.
    • The Analogy: Imagine a master chef's recipe book. It doesn't just say "make soup"; it says, "First, chop the onions. Then, sauté them. If the pot is too hot, lower the flame."
    • How XSKILL uses it: If the robot learns how to solve a complex math problem involving a chart, it writes down a "Recipe" (a Skill) called Chart Analysis. Next time it sees a chart, it pulls out this recipe to know exactly which tools to use and in what order. This fixes the problem of inefficient tool use (wasting time figuring out the basics).
  • Experiences (The "War Stories"):

    • What it is: These are short, punchy tips about specific situations, especially mistakes.
    • The Analogy: Imagine a veteran carpenter telling a junior: "Hey, if you see a piece of wood that looks warped, don't trust your eyes immediately; measure it with a ruler first." Or, "If the paint is dark, turn on the bright light before you try to match the color."
    • How XSKILL uses it: If the robot tries to search an image and fails because the image was upside down, it writes down a "War Story" (an Experience): "When an image looks upside down, rotate it before searching." This fixes the problem of inflexible orchestration (not knowing how to adapt when things go wrong).

2. The Magic Loop: How XSKILL Learns

Most robots just "do" a task and move on. XSKILL adds a special "Review Session" after every task.

  • Phase 1: The "Debrief" (Accumulation)
    Imagine the robot tries to solve a puzzle. It might try three different ways (like rolling dice three times).

    • Visual Grounding: The robot doesn't just read the text of what it did; it looks at the pictures it saw. It realizes, "Oh, I failed because I didn't notice the image was dark."
    • The Critique: A smarter version of the robot (the "Manager") looks at the successful attempts and the failed ones. It asks: "What worked? What failed? Why?"
    • The Update: It then updates the Recipe Book (Skills) with a better workflow and adds a new War Story (Experience) to the tip list. It cleans up old, redundant tips to keep the library organized.
  • Phase 2: The "Mission" (Inference)
    Now, a new task comes in.

    • The Search: The robot breaks the new task into small parts. For each part, it asks: "Do I have a Recipe for this? Do I have a War Story about this specific problem?"
    • The Adaptation: This is the cool part. The robot doesn't just copy-paste the old advice. It adapts it. If the old tip said "Rotate the image," and the new image is upside down, the robot updates the tip to "Rotate this specific image 180 degrees."
    • The Result: The robot solves the new problem much faster and more accurately because it's standing on the shoulders of its past self.

3. Why This is a Big Deal

  • No "Brain Surgery": Usually, to make an AI smarter, you have to retrain it on massive amounts of data (like going back to school for a whole year). XSKILL lets the robot learn on the fly just by keeping a journal. It's like learning from a mentor instead of going to university.
  • It Sees What You See: Many previous systems only read text logs. XSKILL looks at the images the robot saw. It understands that "The image was dark" is a visual fact, not just a text error.
  • It Generalizes: Because it separates the "Recipe" (general rules) from the "War Story" (specific context), the robot can apply what it learned about one type of problem to a completely different type of problem.

The Bottom Line

XSKILL is like giving a robot a personal coach that watches it work, writes down the best strategies and common mistakes, and then hands those notes to the robot before every new challenge.

Instead of a robot that forgets everything after a task is done, XSKILL creates a robot that gets smarter, faster, and more reliable every single day, simply by remembering what it learned yesterday.