XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Imagine you have a very smart robot assistant (a "Multimodal Agent") that can see pictures, read text, and use tools like a magnifying glass, a search engine, or a code calculator to solve problems.

Right now, these robots are like brilliant but forgetful interns. They are great at solving a problem once, but if you ask them a similar problem the next day, they often forget what they learned yesterday. They might try the same wrong approach over and over, or they might get confused by a tricky image (like a picture that is upside down) because they don't have a "mental note" to check for that.

XSKILL is a new system designed to turn these forgetful interns into seasoned veterans without needing to retrain their brains (which is expensive and slow). It does this by giving them two specific types of "cheat sheets" that they build up over time: Skills and Experiences.

Here is how it works, using simple analogies:

1. The Two Types of Cheat Sheets

Think of the robot's memory as a library with two different sections:

Skills (The "Recipe Book"):
- What it is: These are structured, step-by-step guides for complex tasks.
- The Analogy: Imagine a master chef's recipe book. It doesn't just say "make soup"; it says, "First, chop the onions. Then, sauté them. If the pot is too hot, lower the flame."
- How XSKILL uses it: If the robot learns how to solve a complex math problem involving a chart, it writes down a "Recipe" (a Skill) called Chart Analysis. Next time it sees a chart, it pulls out this recipe to know exactly which tools to use and in what order. This fixes the problem of inefficient tool use (wasting time figuring out the basics).
Experiences (The "War Stories"):
- What it is: These are short, punchy tips about specific situations, especially mistakes.
- The Analogy: Imagine a veteran carpenter telling a junior: "Hey, if you see a piece of wood that looks warped, don't trust your eyes immediately; measure it with a ruler first." Or, "If the paint is dark, turn on the bright light before you try to match the color."
- How XSKILL uses it: If the robot tries to search an image and fails because the image was upside down, it writes down a "War Story" (an Experience): "When an image looks upside down, rotate it before searching." This fixes the problem of inflexible orchestration (not knowing how to adapt when things go wrong).

2. The Magic Loop: How XSKILL Learns

Most robots just "do" a task and move on. XSKILL adds a special "Review Session" after every task.

Phase 1: The "Debrief" (Accumulation)
Imagine the robot tries to solve a puzzle. It might try three different ways (like rolling dice three times).
- Visual Grounding: The robot doesn't just read the text of what it did; it looks at the pictures it saw. It realizes, "Oh, I failed because I didn't notice the image was dark."
- The Critique: A smarter version of the robot (the "Manager") looks at the successful attempts and the failed ones. It asks: "What worked? What failed? Why?"
- The Update: It then updates the Recipe Book (Skills) with a better workflow and adds a new War Story (Experience) to the tip list. It cleans up old, redundant tips to keep the library organized.
Phase 2: The "Mission" (Inference)
Now, a new task comes in.
- The Search: The robot breaks the new task into small parts. For each part, it asks: "Do I have a Recipe for this? Do I have a War Story about this specific problem?"
- The Adaptation: This is the cool part. The robot doesn't just copy-paste the old advice. It adapts it. If the old tip said "Rotate the image," and the new image is upside down, the robot updates the tip to "Rotate this specific image 180 degrees."
- The Result: The robot solves the new problem much faster and more accurately because it's standing on the shoulders of its past self.

3. Why This is a Big Deal

No "Brain Surgery": Usually, to make an AI smarter, you have to retrain it on massive amounts of data (like going back to school for a whole year). XSKILL lets the robot learn on the fly just by keeping a journal. It's like learning from a mentor instead of going to university.
It Sees What You See: Many previous systems only read text logs. XSKILL looks at the images the robot saw. It understands that "The image was dark" is a visual fact, not just a text error.
It Generalizes: Because it separates the "Recipe" (general rules) from the "War Story" (specific context), the robot can apply what it learned about one type of problem to a completely different type of problem.

The Bottom Line

XSKILL is like giving a robot a personal coach that watches it work, writes down the best strategies and common mistakes, and then hands those notes to the robot before every new challenge.

Instead of a robot that forgets everything after a task is done, XSKILL creates a robot that gets smarter, faster, and more reliable every single day, simply by remembering what it learned yesterday.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have evolved into active agents capable of using diverse tools (e.g., code execution, web search, image manipulation) to solve complex reasoning tasks. However, current agents face two fundamental bottlenecks in open-ended settings:

Inefficient Tool Use: Agents often waste steps on simple problems or fail to conduct deep exploration on complex queries due to a lack of reusable workflows.
Inflexible Orchestration: Most systems rely on single-path execution and struggle to compose tools adaptively across different tasks.

Existing solutions for continual learning often rely on textual trajectory logs, which are insufficient for multimodal tasks where critical decision signals are grounded in visual observations. Furthermore, most approaches require parameter updates (fine-tuning), which is costly and lacks scalability. The core challenge is enabling agents to continually improve tool efficiency and orchestration flexibility without parameter updates by learning from past trajectories in a way that respects visual context.

2. Methodology: The XSKILL Framework

XSKILL proposes a dual-stream framework that unifies two complementary forms of knowledge: Skills (task-level) and Experiences (action-level). The framework operates in two phases: Accumulation and Inference, utilizing a separate knowledge base model ( $MLLM_{kb}$ ) to manage knowledge and an execution model ( $MLLM_{exec}$ ) to perform tasks.

A. Knowledge Representation

Skills ( $K$ ): Structured, task-level guidance documents (Markdown format) containing workflows, metadata, and reusable tool templates. They provide high-level planning and orchestration strategies.
Experiences ( $E$ ): Concise, action-level tactical prompts (JSON format) capturing context-sensitive insights, failure patterns, and specific decision rules (e.g., "When the image is dark, rotate it").

B. Phase I: Accumulation (Knowledge Extraction & Consolidation)

Given a training dataset, the agent performs multiple rollouts ( $N$ ) to generate trajectories.

Visually-Grounded Rollout Summary: Instead of summarizing text logs, $MLLM_{kb}$ analyzes interleaved image observations and tool outputs to extract decision points, failure reasons, and visual evidence (e.g., "image was inverted, causing misidentification").
Cross-Rollout Critique: The system contrasts successful and failed trajectories to identify causal factors. It extracts new experiences or refines existing ones based on what worked or failed.
Hierarchical Consolidation:
- Skill Consolidation: Fragments from successful rollouts are merged into a global skill document. Redundant or overly specific details are removed, and placeholders are introduced for generalization.
- Experience Consolidation: New experiences are checked for similarity against existing entries. Highly similar insights are merged to reduce redundancy, while low-quality or overly specific items are pruned.

C. Phase II: Inference (Retrieval & Adaptation)

When facing a test task, the system does not rely on static prompting but uses a dynamic retrieval mechanism:

Task Decomposition & Retrieval: The complex query is decomposed into subtasks (e.g., "handle dark image," "search for object"). Relevant skills and experiences are retrieved based on semantic similarity to these subtasks.
Context-Aware Adaptation:
- Experience Rewrite: Retrieved generic experiences are rewritten to match the specific visual context and current task state (e.g., changing "rotate image" to "rotate this specific upside-down image").
- Skill Adaptation: Global skill documents are pruned and adapted to the current multimodal context, integrating rewritten experiences into the workflow steps.
Injection: The adapted knowledge is injected into the agent's system prompt as a non-prescriptive reference, guiding the agent's reasoning without forcing a rigid path.
Feedback Loop: Usage history from the inference phase is fed back into the accumulation phase to refine the knowledge base further.

3. Key Contributions

First Visually-Grounded Dual-Stream Framework: XSKILL is the first framework to unify task-level skills and action-level experiences specifically grounded in visual observations for multimodal agents.
Training-Free Continual Learning: It enables agents to accumulate and leverage knowledge from past interactions without updating model parameters, addressing the scalability issues of fine-tuning.
Complementary Knowledge Streams: The paper demonstrates that Skills and Experiences play distinct, complementary roles: Skills ensure robust tool execution and workflow structure, while Experiences enable flexible, context-aware tool selection and error recovery.
Strong Zero-Shot Generalization: The framework shows superior transferability, allowing knowledge accumulated on one benchmark to improve performance on unseen tasks and different backbone models.

4. Experimental Results

The framework was evaluated on five benchmarks across three domains (Visual Agentic Tool Use, Multimodal Search, Comprehensive Reasoning) using four backbone models (Gemini-2.5-Pro, Gemini-3-Flash, GPT-5-mini, o4-mini).

Performance Gains: XSKILL consistently outperformed strong baselines (including tool-only, Agent Workflow Memory, Dynamic CheatSheet, and Agent-KB).
- Achieved Average@4 improvements of 2.58 to 6.71 points over tool-only baselines.
- On the challenging TIR-Bench with Gemini-3-Flash, it surpassed the strongest baseline (Agent-KB) by 11.13 points.
Ablation Studies:
- Removing either Skills or Experiences caused significant performance drops (3.04% and 3.85% respectively), confirming the necessity of both streams.
- The Accumulation Phase (Knowledge Managers) contributed more to performance than the Inference Phase, highlighting the importance of high-quality knowledge extraction.
Behavioral Analysis:
- Skills significantly reduced execution errors (syntax, runtime, tool name errors) by providing clear templates.
- Experiences shifted tool usage patterns toward more targeted strategies (e.g., increased code interpreter usage for visual reasoning), enabling flexible orchestration.
Cross-Model Transfer: Knowledge accumulated by Gemini-3-Flash successfully improved the performance of GPT-5-mini and o4-mini, demonstrating the portability of the externalized knowledge structure.

5. Significance

XSKILL addresses a critical gap in the evolution of multimodal agents by providing a mechanism for lifelong, training-free learning. By grounding knowledge in visual observations and separating high-level planning (Skills) from tactical decision-making (Experiences), it allows agents to evolve their reasoning capabilities continuously. This approach offers a scalable path toward autonomous systems that can adapt to new tools and tasks in real-world environments without the prohibitive costs of retraining foundation models. The framework also enhances interpretability, as the accumulated knowledge is stored in human-readable, structured formats that can be audited or edited by humans.

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

1. The Two Types of Cheat Sheets

2. The Magic Loop: How XSKILL Learns

3. Why This is a Big Deal

The Bottom Line

1. Problem Statement

2. Methodology: The XSKILL Framework

A. Knowledge Representation

B. Phase I: Accumulation (Knowledge Extraction & Consolidation)

C. Phase II: Inference (Retrieval & Adaptation)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Intelligence Inertia: Physical Principles and Applications

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates