NaviMaster: Learning a Unified Policy for GUI and… — Plain-Language Explanation

Imagine you have two very different friends who need help navigating the world.

Friend A (The GUI Agent) lives inside a smartphone. Their job is to tap buttons, scroll through menus, and type text on a screen to get things done (like booking a flight or searching for a recipe).
Friend B (The Embodied Agent) is a robot walking through a physical (or simulated) house. Their job is to walk down hallways, turn corners, and find objects (like "find the red chair" or "go to the kitchen").

Until now, these two friends were trained separately. They had different teachers, different rulebooks, and different ways of thinking. If you wanted a robot that could both navigate a house and use a computer, you had to build two separate brains and glue them together. It was expensive, inefficient, and the robot often got confused when switching between the two worlds.

Enter NaviMaster: The "Universal Navigator"

The researchers behind this paper created NaviMaster, a single, super-smart agent that can do both jobs with one brain. Think of NaviMaster not as two separate friends, but as a multitasking Swiss Army Knife that can switch between being a robot and a screen-tapper instantly.

Here is how they did it, explained with some simple analogies:

1. The "Universal Language" (Visual-Target Trajectory)

Imagine you are teaching a child to walk and to play a video game.

The Old Way: You teach them walking with one set of rules ("Step forward, turn left") and gaming with another ("Click the red button, scroll up").
The NaviMaster Way: The researchers realized that both tasks are actually the same thing deep down: "Look at where I am, decide where I want to go, and move there."

They created a new way to describe instructions called "Visual-Target Trajectories." Instead of saying "Click the blue button" (GUI) or "Walk forward" (Robot), they translate everything into a single language: "Look at this specific spot on the screen (the target) and move your cursor or your feet to get there."

It's like translating both French and Spanish into a universal "Hand Gesture" language. Suddenly, the agent doesn't care if it's moving a mouse or a robot wheel; it just cares about moving from Point A to Point B.

2. The "Gym Teacher" (Unified Reinforcement Learning)

Once the agent speaks the universal language, it needs to learn how to be good at it.

The Old Way: The agent practiced only on phone screens, then only on robot maps. It got really good at one but terrible at the other.
The NaviMaster Way: They threw all the data into a giant blender. The agent practiced on phone screens and robot maps simultaneously.

This is like a gym teacher who makes an athlete run on a treadmill (screens) and climb a rock wall (robotics) in the same workout. The athlete learns that "balance" and "planning" are useful in both situations. This makes the agent much smarter and more adaptable when it encounters a new, strange environment it has never seen before.

3. The "GPS with a Compass" (Distance-Aware Reward)

When training an AI, you usually give it a "reward" (like a gold star) when it succeeds.

The Old Way: The reward was binary. You either got the gold star (perfect click) or nothing (wrong click). If you were almost right, you got nothing. This is like a GPS that only says "You are lost" or "You are there," with no "You are close" option. It's frustrating and slow to learn.
The NaviMaster Way: They introduced a Distance-Aware Reward. If the agent is close to the target, it gets a small reward. If it's very close, it gets a big reward.

Think of it like the classic game "Hot and Cold." Instead of just saying "Wrong," the teacher says, "You're getting warmer!" This gives the agent constant feedback, helping it learn much faster and more efficiently.

Why Does This Matter?

The results are impressive. Because NaviMaster learned from both worlds at the same time, it became a super-generalist.

When asked to navigate a new, weird app it has never seen, it does better than agents trained only on apps.
When asked to find an object in a new room, it does better than agents trained only on robots.

In a nutshell:
NaviMaster is the first agent that realized that tapping a screen and walking through a room are just two different ways of doing the same thing: navigating to a goal. By teaching it to see the world through this unified lens, the researchers created a more robust, efficient, and intelligent navigation system that can handle almost any environment thrown at it.

1. Problem Statement

Graphical User Interface (GUI) navigation and Embodied Navigation (in 3D physical/simulated environments) have historically evolved as separate domains with distinct datasets, action spaces, and training paradigms. This separation leads to four critical challenges:

Redundancy: Maintaining separate models for each task increases deployment costs and prevents synergistic learning.
Poor Generalization: Models trained on specific domains (e.g., only mobile apps or only 3D rooms) fail to generalize to out-of-domain (OOD) scenarios.
Training Inefficiency: Existing Reinforcement Fine-Tuning (RFT) approaches often rely on sparse binary rewards (success/failure), leading to inefficient learning and high variance.
Reasoning-Action Gap: Current models often generate correct reasoning thoughts but fail to execute the correct visual actions because their "understanding" is text-distilled rather than visually grounded.

The authors propose that both tasks can be unified under the Markov Decision Process (MDP) framework, where the agent must transform egocentric visual observations into an allocentric mental map to make decisions, regardless of whether the environment is a 2D screen or a 3D space.

2. Methodology: NaviMaster

NaviMaster is the first unified agent capable of handling both GUI and Embodied navigation within a single framework. It consists of three core components:

A. Visual-Target Trajectory Collection Pipeline

To unify the disparate action spaces of GUI (e.g., CLICK(x,y)) and Embodied (e.g., MOVEFORWARD) tasks, the authors propose a Visual-Target formulation:

Unified Action Space: They categorize actions into three types: Specific (e.g., BACK, STOP), View-shifting (e.g., SCROLL, TURN), and Localization.
Reformulating Localization: In GUI, localization is CLICK(x,y). In Embodied, it is traditionally MOVEFORWARD. NaviMaster reformulates Embodied localization to MOVETO(x,y), where $(x,y)$ represents a visual target point projected onto the current camera view. This allows both tasks to share the same action space definition.
Data Generation: They construct a pipeline to generate trajectories $\tau = \{I, (o_0, t_0, a_0), \dots, (o_n, t_n, a_n)\}$ $τ = {I, (o_{0}, t_{0}, a_{0}), \dots, (o_{n}, t_{n}, a_{n})}$ , where $I$ $I$ is the instruction, $o$ $o$ is the observation, $a$ $a$ is the action, and $t$ $t$ is a reasoning thought.
- For Embodied tasks, they use A* search to find shortest paths in 3D environments (e.g., Matterport3D), project these 3D points onto 2D camera views, and generate corresponding MOVETO actions.
- They use GPT-4o to generate first-person reasoning thoughts ( $t_i$ ) explaining the rationale behind each action, enhancing the model's ability to learn from history.

B. Unified Reinforcement Learning Framework

Instead of Supervised Fine-Tuning (SFT), NaviMaster employs Group Relative Policy Optimization (GRPO) for training.

Input: The model takes the user instruction, current observation, and historical context (previous thoughts and actions) to predict the next action.
Policy Optimization: The model is trained on a mix of GUI and Embodied data simultaneously. This forces the policy to learn generalizable structural representations (e.g., object permanence, spatial reasoning) rather than overfitting to task-specific correlations.

C. Distance-Aware Dense Reward

To address the inefficiency of sparse rewards, NaviMaster introduces a dense reward function composed of three parts:

Format Reward ( $R_F$ ): Ensures the output follows the required JSON structure with reasoning tags.
Type Reward ( $R_T$ ): A binary reward checking if the predicted action type (e.g., CLICK vs. SCROLL) matches the ground truth.
Grounding Dense Reward ( $R_G$ ): Unlike binary rewards, this rewards the model based on the distance between the predicted point and the ground-truth target.
- Formula: $R_G = (1 - \frac{d_j}{\theta_d})$ if the distance $d_j$ is within a threshold $\theta_d$ .
- This provides a gradient signal even for "almost correct" actions, significantly improving training stability and convergence speed.

3. Key Contributions

Unified Agent: NaviMaster is the first agent to jointly learn GUI and Embodied navigation in a single framework, eliminating the need for separate models.
Visual-Target Trajectory: A novel data collection pipeline that unifies action spaces by introducing explicit visual targets for embodied navigation, enabling joint training on mixed data.
Dense Reward Design: The introduction of a distance-aware dense reward mechanism that significantly enhances learning efficiency and spatial grounding capabilities compared to traditional sparse rewards.
Superior Generalization: The unified training strategy enables the model to achieve state-of-the-art performance on Out-of-Domain (OOD) benchmarks for both task types.

4. Experimental Results

The authors evaluated NaviMaster on extensive benchmarks, including:

GUI Benchmarks: AC-High/Low, AITW, OmniAct, etc.
Embodied Benchmarks: ObjectNav (Habitat), Spatial Affordance Prediction (RoboReflt, Where2Place, etc.).

Key Findings:

OOD Generalization: NaviMaster significantly outperforms state-of-the-art baselines (including GPT-4o, OS-Atlas, and specialized RFT models) on OOD datasets. For example, on the AC-High benchmark, it achieved a 69.46% success rate (SR), surpassing the previous best by a notable margin.
Spatial Affordance: In spatial referring tasks (identifying objects or free space based on language), NaviMaster achieved the highest success rates across all tested datasets (e.g., 77.34% on RoboReflt).
Embodied Navigation: On the ObjectNav benchmark, NaviMaster achieved an SR of 33.20% and SPL of 12.60%, outperforming specialized embodied agents like RoboPoint.
Ablation Studies:
- Data Mixing: A 50:50 mix of GUI and Embodied data yielded the best performance, proving that cross-domain data enhances generalization.
- Reward Design: Models trained with the proposed dense reward consistently outperformed those trained with sparse rewards, showing faster convergence and higher final accuracy.
- Base Models: The improvements held true across different base models (Qwen2.5VL-3B, 7B), confirming the efficacy of the unified training strategy rather than just the base model strength.

5. Significance

NaviMaster represents a paradigm shift in navigation agent research by demonstrating that GUI and Embodied navigation are isomorphic problems at the level of perception and decision-making.

Efficiency: It reduces the cost of developing separate agents for digital and physical worlds.
Generalization: It proves that training on mixed, diverse data creates agents that are more robust to distribution shifts and better at spatial reasoning.
Scalability: The visual-target formulation and dense reward design provide a scalable blueprint for training general-purpose navigation agents that can interact with any environment (2D or 3D) using a single policy.

The code, data, and checkpoints are publicly available, fostering further research into unified embodied and digital agents.

NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks