Learning Transferable Skills in Action RPGs via Directed Skill Graphs and Selective Adaptation

Imagine you are trying to teach a robot how to play a notoriously difficult video game like Dark Souls. In this game, you have to dodge attacks, aim your camera, move around, and decide when to attack or heal—all in real-time. If you try to teach the robot everything at once (like telling a human to "just play the game"), it usually fails. The robot gets overwhelmed, learns slowly, and if the game changes slightly (like a boss getting a new move), the robot has to start from scratch.

This paper proposes a smarter way to train the robot, using a concept called a "Directed Skill Graph."

Here is the breakdown of their idea using simple analogies:

1. The Problem: The "Swiss Army Knife" vs. The "Specialized Team"

Most AI tries to be a Swiss Army Knife: one single brain trying to do everything at once.

The Flaw: If the game changes, the whole knife has to be re-forged. It's inefficient and fragile.
The Paper's Solution: Instead of one brain, they built a specialized team of five experts. Each expert has one tiny, specific job.

2. The Five Experts (The Skills)

The robot's brain is split into five distinct "skills," each with its own little brain:

The Camera Operator: Just looks at the enemy.
The Lock-On Specialist: Keeps the enemy centered in the crosshairs.
The Footwork Coach: Decides where to walk (strafing, circling).
The Dodge Master: Times the rolls to avoid getting hit.
The Tactician: Decides when to attack and when to drink a healing potion.

The Analogy: Imagine a Formula 1 racing team. You don't have one person who drives, refuels, changes tires, and talks to the radio all at once. You have a driver, a pit crew, and a strategist. They work together, but they are experts in their own lanes.

3. The Training Method: The "Construction Site"

The researchers didn't train all five experts at the same time. They used a hierarchical curriculum, which is like building a house:

Step 1: You build the foundation first (Camera and Lock-on). You don't worry about the roof yet.
Step 2: Once the foundation is solid, you build the walls (Movement).
Step 3: Then you add the roof (Dodging).
Step 4: Finally, you furnish the house (Attack/Heal decisions).

Why this works: By training them in order, the "Dodge Master" learns while the "Camera Operator" is already perfect. The Dodge Master doesn't have to worry about the camera moving wildly; it can focus entirely on timing its rolls. This makes learning much faster (more sample efficient).

4. The "Phase Shift" Test: When the Boss Gets Angry

In Dark Souls, bosses often have two phases. Phase 1 is standard; Phase 2 is faster, hits harder, and has new moves.

The Old Way: If the boss changes, the whole AI has to relearn everything from zero.
The New Way (Selective Adaptation): The researchers realized that the "Camera Operator" and "Footwork Coach" don't need to change. A camera still looks at an enemy whether the boss is slow or fast.
- So, they froze the first three experts (Camera, Lock-on, Movement).
- They only retrained the last two experts (Dodging and Tactician) to handle the new, harder boss moves.

The Result: The robot adapted to the new, harder boss in a fraction of the time it would have taken to retrain the whole system. It's like a musician who knows how to play a song. If the song gets a slightly faster tempo, they don't need to relearn how to hold the guitar or read the notes; they just adjust their finger speed (the "downstream" skills).

5. The Big Takeaway

The paper proves that breaking a complex problem into small, specialized parts makes AI:

Faster to learn: It learns the basics first, then builds on them.
More flexible: When the world changes, you only have to update the parts that actually changed, not the whole system.
More robust: It doesn't "forget" how to look at the enemy just because the enemy got stronger.

In a nutshell: Instead of trying to teach a robot to be a "perfect player," they taught it to be a team of "perfect specialists" who know exactly how to work together. When the game gets harder, they just tweak the specialists who need it, leaving the experts who are already doing a great job alone.

1. Problem Statement

The paper addresses the challenge of lifelong learning in complex, real-time control environments, specifically using Dark Souls III as a testbed. The core difficulties in this domain include:

Non-stationarity: Tasks change dynamically (e.g., boss phases), requiring agents to adapt without catastrophic forgetting.
Sample Inefficiency: Monolithic end-to-end Reinforcement Learning (RL) policies often require massive amounts of data and fail to generalize when the environment shifts.
Plasticity-Stability Trade-off: Agents must rapidly adapt to new conditions (plasticity) while retaining previously learned, useful behaviors (stability).
Complexity: Real-time combat involves coupled subproblems (camera control, targeting, movement, defense, resource management) that are difficult to learn simultaneously in a single policy.

2. Methodology

The authors propose a Modular Skill Graph Architecture trained via a Hierarchical Curriculum.

A. Directed Skill Graph

Instead of a single monolithic policy, the agent decomposes control into five reusable, specialized skills, each with its own observation space and policy ( $\pi_k$ ):

Camera Control ( $C$ ): Aligns the view with the target.
Target Lock-on ( $L$ ): Maintains a valid lock state on the enemy.
Movement ( $M$ ): Handles positioning and engagement distance.
Dodging ( $D$ ): Executes defensive maneuvers to avoid damage.
Heal–Attack Decision ( $H$ ): Decides when to attack or use healing resources.

Execution: At runtime, these policies run concurrently (multi-threaded). Their outputs are composed into a single control signal via a fixed operator $C(\cdot)$ , mimicking human gameplay where view, movement, and combat decisions happen simultaneously.

B. Hierarchical Curriculum Training

The skills are trained sequentially based on a dependency chain:
$C \rightarrow L \rightarrow M \rightarrow D \rightarrow H$

Staged Training: When training a downstream skill (e.g., $D$ ), all upstream skills ( $C, L, M$ ) are frozen.
Benefit: This constrains the reachable state distribution to task-relevant configurations, reducing the exploration burden for downstream skills. For example, a competent camera and lock-on policy stabilize the input for the movement and dodging policies.
Cooperation: The curriculum encourages cooperative specialization; downstream skills learn to optimize their objectives without disrupting the established competencies of upstream skills.

C. Selective Adaptation (Post-Training)

To handle domain shifts (e.g., moving from Boss Phase 1 to Phase 2), the framework employs Selective Fine-tuning:

Transferable Upstream Skills: Skills capturing phase-invariant mechanics (Camera, Lock-on, Movement) are kept frozen.
Adaptable Downstream Skills: Only the phase-sensitive skills (Dodging and Heal-Attack) are fine-tuned on the new domain data.
Goal: This allows the agent to adapt rapidly with a limited interaction budget without retraining the entire system or overwriting foundational skills.

D. Implementation Details

Environment: Dark Souls III (Boss: Iudex Gundyr).
Interface: Process-memory readout (Cheat Engine) providing a compact 25-dimensional state vector (positions, health, stamina, animation states). No pixel-based vision is used.
Algorithm: Deep Q-Networks (DQN) are used for all skills to demonstrate that the architecture, not algorithmic sophistication, drives the results.
Rewards: Each skill has a specific reward function tailored to its narrow responsibility (e.g., minimizing camera-target angle for $C$ , survival time for $D$ , trade-off between damage dealt/taken for $H$ ).

3. Key Contributions

Formulation of Combat as a Directed Skill Graph: The authors successfully modeled complex real-time combat as a set of five modular, reusable skills linked by explicit dependencies.
Hierarchical Training Protocol: They demonstrated that training skills sequentially (freezing upstream components) significantly improves sample efficiency compared to training downstream skills in isolation or end-to-end.
Selective Post-Training Mechanism: The paper provides empirical evidence that under a domain shift (Phase 1 $\to$ Phase 2), freezing upstream skills and fine-tuning only a subset of downstream skills allows for rapid recovery of performance.
Ablation Studies: Controlled experiments replacing specific policies with random actions confirmed the critical role of downstream skills (Dodging and Heal-Attack) while showing upstream skills remain robust and transferable.

4. Experimental Results

The experiments were conducted on the Dark Souls III boss encounter, split into Phase 1 and Phase 2.

Sample Efficiency:
- The modular skill graph achieved a 44% win rate in Phase 1 with an interaction budget of ~230k steps.
- In contrast, a monolithic end-to-end DQN baseline failed to learn a reliable combat behavior even after extensive training, plateauing with a 0% win rate. The end-to-end agent collapsed into a poor survival heuristic (repeatedly dodging backward) without learning effective attack strategies.
Transfer and Adaptation (Phase 1 $\to$ Phase 2):
- Zero-Shot Transfer: Without any retraining, the Phase 1 agent achieved a 33.3% win rate in Phase 2 (mid-range start), demonstrating that upstream skills (Camera, Lock-on, Movement) are highly transferable.
- Selective Fine-tuning: By fine-tuning only the Dodging ( $D$ ) and Heal-Attack ( $H$ ) policies on Phase 2 data, the win rate increased to 52.0%. This was achieved with a limited interaction budget, proving that adaptation can be localized to specific components.
Ablation Analysis:
- Randomizing downstream skills ( $D$ or $H$ ) caused performance to drop drastically (to 0% or 4%), confirming their necessity for success.
- Randomizing upstream skills while keeping downstream skills trained resulted in poor performance, highlighting the dependency of downstream skills on stable upstream inputs.

5. Significance

This work offers a practical pathway toward evolving, continually learning agents in complex, real-time environments.

Scalability: By decomposing control into narrow responsibilities, the approach mitigates the "curse of dimensionality" and sample inefficiency common in end-to-end RL.
Robustness to Shifts: The selective adaptation mechanism solves the plasticity-stability dilemma by allowing agents to update only the parts of the system sensitive to environmental changes, preserving previously learned, transferable knowledge.
Generalizability: While tested on Dark Souls III, the framework of directed skill graphs and hierarchical curricula is applicable to other domains requiring modular control and lifelong learning, such as robotics and complex simulation environments.

In conclusion, the paper argues that structuring agents around skill dependencies is a superior strategy for lifelong learning compared to monolithic policies, enabling efficient learning, robust transfer, and rapid adaptation with minimal data.