Original authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

Published 2026-06-05

📖 4 min read☕ Coffee break read

Original authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are teaching a robot to play the video game Minecraft. In the past, researchers had to build separate "brains" for different parts of the game. One brain was good at walking around (using simple, step-by-step commands), another was good at clicking menus (using precise mouse movements), and a third was good at using high-level shortcuts (like "go to the nearest tree").

The problem was that these robots were rigid. If a task required walking and then clicking a menu, the robot would get confused or stuck because it was trained to only use one type of "language" at a time.

The Big Idea: The "Swiss Army Knife" Agent
This paper introduces CrossHA, a new kind of AI agent that acts like a master chef with a full kitchen. Instead of being forced to use only a spoon or only a knife, CrossHA learns to look at the ingredient (the task) and instantly decide: "Do I need a rough chop (a simple movement) or a fine slice (a precise click) right now?"

It is the first model that can seamlessly switch between different "action languages" on the fly, without a human telling it which one to use at every step.

How They Taught It (The Training Recipe)

The researchers didn't just dump data on the model; they used a three-step training pipeline, similar to how a human learns a complex skill:

The "Tasting" Phase (Supervised Fine-Tuning):
First, they showed the model thousands of examples of people playing the game using different methods (some walked, some clicked, some used shortcuts). The goal here was just to teach the model the "vocabulary" of all these different methods so it could understand them all. It was like teaching a student to read English, Spanish, and French, but not yet asking them to write a story.
The "Sprint" Phase (Single-Turn RL):
Next, they gave the model short, one-step challenges. If the model tried to solve a problem using the wrong "language" (e.g., trying to walk through a locked door instead of clicking the handle), it got a low score. If it picked the right tool for the job, it got a reward. This taught the model to make quick, smart choices about which tool to pick for a single moment.
The "Marathon" Phase (Multi-Turn RL):
Finally, they let the model play full, long games. The goal wasn't just to get one step right, but to finish the whole quest efficiently. The model learned that sometimes using a "fast but rough" method is better for walking across a field, but switching to a "slow but precise" method is necessary for crafting a delicate item. It learned to balance speed and accuracy over a long journey.

The Results: Why It Matters

The researchers tested this model on over 800 different tasks in Minecraft, ranging from chopping down trees to crafting complex items and fighting monsters.

The Old Way: Robots trained to only use one method (like only walking) were great at walking but terrible at crafting. They were like a person who only knows how to run but doesn't know how to open a door.
The CrossHA Way: The new model was the best at everything. It didn't just average out the scores; it excelled because it knew exactly when to switch gears.

The Analogy of the "Smart Driver":
Imagine driving a car.

A fixed-action robot is like a driver who only knows how to drive on a highway. If they hit a dirt road, they crash. If they hit a parking lot, they get lost.
CrossHA is like a skilled driver who knows when to shift into high gear for the highway (efficiency) and when to shift into low gear and steer carefully for a tight parking spot (precision). It doesn't need a co-pilot to tell them when to shift; they just feel the road and adapt.

The Bottom Line

The paper claims that by training one model to master all these different "action spaces" and letting it learn through reinforcement learning (trial and error with rewards), they created an agent that is significantly smarter, more flexible, and more efficient than previous models that were stuck using only one type of action.

They proved this by showing the model could handle over 800 different challenges in a complex, open-world game, outperforming all other existing methods. The code and models are now open for others to use.

Technical Summary: CrossHA – Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

1. Problem Statement

The paradigm of agentic AI is shifting from engineered workflows around pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) toward native agentic models developed through post-training. However, existing agents are typically confined to static, predefined action spaces (e.g., exclusively using APIs, GUI events, or robotic commands). This rigidity creates two fundamental limitations:

Brittleness: Specific action policies or translation layers often fail in dynamic environments (e.g., API functions blocked by CAPTCHAs, or robotic policies lacking precision).
Inflexibility: Manual assignment of action spaces to tasks restricts adaptability. Crucially, the optimal action space often varies not only across tasks but within a single task at the step level. For instance, a "Deep Research" agent might efficiently use search APIs for information gathering but require precise GUI-level manipulation to bypass a CAPTCHA.

Current approaches often rely on complex, brittle pipelines to orchestrate transitions between disjoint action spaces or merge heterogeneous trajectories without explicitly optimizing the selection mechanism. There is a lack of unified models that can autonomously select the most effective interface (ranging from high-level APIs to low-level primitives) for each step of a trajectory based on context.

2. Methodology: The CrossHA Framework

To address these limitations, the authors propose CrossHA, a unified agentic model trained to master multiple heterogeneous action spaces and autonomously select the appropriate interface for each step. The model is trained via a comprehensive three-stage pipeline integrating cold-start supervised fine-tuning (SFT) with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm.

2.1. Problem Formulation

The task is modeled as a Markov Decision Process (MDP) with a composite action space $\mathcal{A} = \bigcup_{x=1}^{N} \mathcal{A}_x$ , where each subspace $\mathcal{A}_x$ corresponds to a distinct class of actions (e.g., low-level motor controls vs. high-level motion primitives). The objective is to maximize expected reward while penalizing computational or operational costs associated with different action granularities:
$J = \mathbb{E}\left[\sum_{t} (r_t - \lambda_x \text{cost}(a_t))\right]$

2.2. Training Pipeline

The training process consists of three progressive stages:

Stage 1: Mixed-Space Supervised Fine-Tuning (SFT)
The model is initialized on a balanced dataset comprising trajectories from multiple action subspaces. The goal is to construct a robust base model ( $M_{mix}$ ) capable of decoding and generating valid actions across diverse modalities without modal interference. At this stage, the model learns syntax and semantics but does not yet autonomously select the optimal action space.
Stage 2: Single-Turn Reinforcement Learning (STRL)
This stage empowers the model to autonomously select the appropriate action space for immediate task contexts.
- Warm-up: A diversity-enhanced SFT phase generates candidate actions across all spaces, filtering for successful executions to create ground-truth annotations.
- Optimization: The model is fine-tuned using Group Relative Policy Optimization (GRPO). Unlike PPO, GRPO estimates the advantage based on the relative performance of a group of sampled outputs rather than a learned value function. The reward is action-space agnostic: credit is granted if the parsed raw action matches the ground truth, regardless of the surface form used. This encourages the model to select the most reliable action space for a given input.
Stage 3: Multi-Turn Reinforcement Learning (MTRL)
To optimize for long-horizon reasoning and trajectory efficiency, the model undergoes MTRL.
- Initialization: A self-training step relabels the dataset based on the STRL model's preferences, creating a strong prior for appropriate action-space selection ( $M_{cs2}$ ).
- Trajectory Optimization: The model is fine-tuned on a curated task set using a binary episodic reward (success/failure) combined with a penalty for token length. This encourages the agent to balance task success with execution efficiency, preferring concise high-level actions when possible and switching to fine-grained atomic actions when necessary.

3. Key Contributions

CrossHA Model: A unified agentic model capable of mastering heterogeneous action spaces and autonomously selecting the context-appropriate interface without relying on human-defined heuristics.
Comprehensive RL Pipeline: The introduction of a training framework utilizing Multi-Turn GRPO, enabling the agent to learn adaptive action switching within a single trajectory to maximize both task success and execution efficiency.
State-of-the-Art Performance: Demonstration of superior generalization and robustness in long-horizon reasoning tasks compared to static baselines.

4. Experimental Results

The authors evaluated CrossHA in the open-world Minecraft environment (version 1.16.5), utilizing the OpenHA benchmark suite which contains over 800 manually designed tasks categorized into:

Mine Blocks: Navigation and physical interaction.
Craft Items: Complex GUI interactions.
Kill Entities: Survival and combat.

Key Findings:

Performance: CrossHA achieved state-of-the-art performance across all categories. Notably, it reached a 94.7% Finished Tasks (FT) rate in "Mine Blocks" and 83.3% in "Craft Items," significantly outperforming fixed-action baselines.
Generalization: Despite being trained on only 30 tasks (10 from each category), CrossHA successfully generalized to over 800 evaluation tasks.
Ablation Studies:
- Mixed Action Spaces: CrossHA converged faster and reached higher asymptotic performance than single-space baselines (e.g., GroundingHA, MotionHA) during MTRL, demonstrating improved data efficiency.
- STRL Necessity: Including the STRL stage significantly enhanced training efficiency and final performance, particularly in Out-of-Distribution (OOD) tasks.
- Robustness: While single-space agents showed high performance on In-Distribution (ID) tasks but dropped sharply on OOD tasks, CrossHA maintained a smaller generalization gap, suggesting dynamic action-space selection mitigates overfitting.

5. Significance and Claims

The paper claims that CrossHA represents a significant step toward truly generalist agents. By treating action-space selection as a learnable component optimized via reinforcement learning, the model overcomes the rigidity of static designs. The results demonstrate that adaptive action-space selection yields superior generalization and robustness compared to static baselines.

The authors emphasize that the model learns to balance trade-offs dynamically: prioritizing high-level actions for efficiency when applicable, while employing fine-grained atomic actions for precise control when necessary. This capability allows the agent to handle complex scenarios requiring multi-modal interactions without manual intervention.

Future work, as noted by the authors, includes improving the efficiency of multi-turn RL and extending the framework to real-world robotics settings, where challenges such as safety and latency arise.

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning