Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

The paper introduces CrossHA, a unified agentic model trained via a novel pipeline combining supervised fine-tuning and Multi-Turn Group Relative Policy Optimization to autonomously select and switch between heterogeneous action spaces, thereby achieving state-of-the-art performance and adaptability in dynamic, long-horizon tasks within the Minecraft environment.

Original authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

Published 2026-06-05
📖 4 min read☕ Coffee break read

Original authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are teaching a robot to play the video game Minecraft. In the past, researchers had to build separate "brains" for different parts of the game. One brain was good at walking around (using simple, step-by-step commands), another was good at clicking menus (using precise mouse movements), and a third was good at using high-level shortcuts (like "go to the nearest tree").

The problem was that these robots were rigid. If a task required walking and then clicking a menu, the robot would get confused or stuck because it was trained to only use one type of "language" at a time.

The Big Idea: The "Swiss Army Knife" Agent
This paper introduces CrossHA, a new kind of AI agent that acts like a master chef with a full kitchen. Instead of being forced to use only a spoon or only a knife, CrossHA learns to look at the ingredient (the task) and instantly decide: "Do I need a rough chop (a simple movement) or a fine slice (a precise click) right now?"

It is the first model that can seamlessly switch between different "action languages" on the fly, without a human telling it which one to use at every step.

How They Taught It (The Training Recipe)

The researchers didn't just dump data on the model; they used a three-step training pipeline, similar to how a human learns a complex skill:

  1. The "Tasting" Phase (Supervised Fine-Tuning):
    First, they showed the model thousands of examples of people playing the game using different methods (some walked, some clicked, some used shortcuts). The goal here was just to teach the model the "vocabulary" of all these different methods so it could understand them all. It was like teaching a student to read English, Spanish, and French, but not yet asking them to write a story.

  2. The "Sprint" Phase (Single-Turn RL):
    Next, they gave the model short, one-step challenges. If the model tried to solve a problem using the wrong "language" (e.g., trying to walk through a locked door instead of clicking the handle), it got a low score. If it picked the right tool for the job, it got a reward. This taught the model to make quick, smart choices about which tool to pick for a single moment.

  3. The "Marathon" Phase (Multi-Turn RL):
    Finally, they let the model play full, long games. The goal wasn't just to get one step right, but to finish the whole quest efficiently. The model learned that sometimes using a "fast but rough" method is better for walking across a field, but switching to a "slow but precise" method is necessary for crafting a delicate item. It learned to balance speed and accuracy over a long journey.

The Results: Why It Matters

The researchers tested this model on over 800 different tasks in Minecraft, ranging from chopping down trees to crafting complex items and fighting monsters.

  • The Old Way: Robots trained to only use one method (like only walking) were great at walking but terrible at crafting. They were like a person who only knows how to run but doesn't know how to open a door.
  • The CrossHA Way: The new model was the best at everything. It didn't just average out the scores; it excelled because it knew exactly when to switch gears.

The Analogy of the "Smart Driver":
Imagine driving a car.

  • A fixed-action robot is like a driver who only knows how to drive on a highway. If they hit a dirt road, they crash. If they hit a parking lot, they get lost.
  • CrossHA is like a skilled driver who knows when to shift into high gear for the highway (efficiency) and when to shift into low gear and steer carefully for a tight parking spot (precision). It doesn't need a co-pilot to tell them when to shift; they just feel the road and adapt.

The Bottom Line

The paper claims that by training one model to master all these different "action spaces" and letting it learn through reinforcement learning (trial and error with rewards), they created an agent that is significantly smarter, more flexible, and more efficient than previous models that were stuck using only one type of action.

They proved this by showing the model could handle over 800 different challenges in a complex, open-world game, outperforming all other existing methods. The code and models are now open for others to use.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →