Anticipatory Planning for Multimodal AI Agents

Imagine you are teaching a robot to navigate a busy city to get a cup of coffee.

The Old Way (Reactive Agents):
Most current AI agents are like a tourist who only looks at the street directly in front of their feet. They see a red light, they stop. They see a turn, they turn. They don't think about what happens after the turn. If they turn left, they might accidentally drive into a dead end three blocks later, but they won't realize it until they get there. They are "reactive"—they only respond to the immediate moment. This works for simple tasks, but if you ask them to "get coffee, then pick up a dry cleaning, and finally go to the bank," they often get lost, confused, or stuck in loops because they can't see the big picture.

The New Way (TraceR1):
The paper introduces TraceR1, a new way to train AI agents that acts more like a strategic chess player or a seasoned tour guide. Instead of just looking at the next step, TraceR1 is trained to "look ahead" and imagine the next few moves before making a single move.

Here is how it works, broken down into a simple story:

1. The "Mental Rehearsal" (Stage 1: Anticipatory Planning)

Imagine you are about to play a complex board game. Before you touch a piece, you close your eyes and run through a few scenarios in your head: "If I move here, my opponent might move there, and then I'll be stuck."

TraceR1 does exactly this. It doesn't just decide "Click the button." It predicts a whole movie of what will happen next:

Step 1: Click the button.
Step 2: A menu will pop up.
Step 3: I will click the "Settings" option.
Step 4: The font size will change.

It practices this "mental movie" over and over. If the movie ends in a dead end, it learns to change the plan before it actually does anything. This teaches the AI to understand cause and effect over time, not just in the split second.

2. The "Reality Check" (Stage 2: Grounded Execution)

Sometimes, our mental movies are too optimistic. We might think, "I'll just jump over that puddle," but in reality, we might slip.

In the second stage, TraceR1 takes its plan and tries to execute the very first step in the real world (or a simulated world) using a "tool agent" (a helper robot that actually clicks the mouse).

The Planner (TraceR1): "I think clicking here will open the menu."
The Executor (Tool Agent): Clicks the mouse. "Oops, that actually opened a different window."

The system then says, "Okay, my prediction was wrong. I need to adjust my mental movie." It uses this real-world feedback to fine-tune its predictions. It's like a pilot who practices a landing in a simulator (Stage 1) and then checks the actual controls on the plane (Stage 2) to make sure the simulation matches reality.

Why is this a big deal?

Most AI today is like a hamster on a wheel—it runs fast and reacts to the wheel spinning, but it doesn't know where it's going.

TraceR1 is like a hiker with a map and a compass.

It looks at the map (the future trajectory) to see where the cliffs are.
It takes a step (execution).
It checks the terrain (feedback).
It adjusts the route.

The Results

The researchers tested this on seven different "cities" (benchmarks), including:

Computer tasks: Like changing settings on a phone or computer.
Tool tasks: Like analyzing a PDF or writing code.

The outcome? TraceR1 didn't just get better at following orders; it got better at not making mistakes. It stopped getting stuck in loops, stopped clicking the wrong buttons, and could handle complex, multi-step instructions (like "cancel my meeting, then email my boss") much better than previous models. It even performed as well as some expensive, proprietary systems owned by big tech companies, but it's open-source and free for others to use.

In a Nutshell

TraceR1 teaches AI to think before it acts. By combining "mental rehearsal" (planning the future) with "reality checks" (testing the first step), it creates an agent that is less likely to get lost and much better at solving complex, real-world problems. It's the difference between a robot that trips over its own feet and a robot that knows exactly where it's going.

1. Problem Statement

Current multimodal AI agents (capable of interacting with GUIs and using tools) are predominantly reactive. They determine the next action based solely on the current observation without reasoning about future states or long-term goals.

Limitations: This reactive approach leads to a lack of planning coherence, causing agents to fail in multi-step tasks where actions have delayed or compounding effects. They often diverge from the intended task because they cannot anticipate downstream consequences.
Existing Challenges:
- Model-Free RL: Often relies on sparse rewards or subgoals, failing to capture global consistency.
- Model-Based Planning: Requires constructing world models for visually rich, interactive environments, which is notoriously difficult and computationally expensive.
Goal: To develop a framework that enables multimodal agents to perform anticipatory reasoning—forecasting short-horizon trajectories before execution—while maintaining grounded execution accuracy.

2. Methodology: TraceR1

TraceR1 is a two-stage Reinforcement Learning (RL) framework designed to combine long-horizon trajectory reasoning with grounded execution refinement.

Stage 1: Anticipatory Trajectory Optimization

This stage focuses on learning global consistency and foresight.

Mechanism: The model performs trajectory-level RL on large-scale agent trajectories. Instead of optimizing single steps, it predicts a sequence of future actions (a trajectory) $\hat{\tau}_{t:T}$ .
Reward Function: A discounted trajectory-level reward $R(\hat{\tau}, \tau^*)$ $R (\overset{τ}{^}, τ^{*})$ is used to align the predicted trajectory with a ground-truth reference trajectory.
- Alignment Reward ( $r_{align}$ ): Measures similarity between predicted and reference action types (e.g., GUI clicks, tool calls).
- Repetition Penalty ( $r_{rep}$ ): Penalizes cyclic or redundant actions to prevent "reward hacking" (e.g., clicking the same button repeatedly).
- Temporal Discount ( $\gamma$ ): Prioritizes near-future correctness while maintaining long-term coherence.
Optimization: Uses Group Relative Policy Optimization (GRPO) to update the policy, encouraging the model to reason several steps ahead before acting.

Stage 2: Grounded Reinforcement Fine-tuning

This stage focuses on step-level accuracy and executability.

Mechanism: The model is fine-tuned using feedback from frozen tool agents (e.g., GUI executors or tool-callers).
Process:
1. The model predicts a trajectory.
2. Only the first predicted step is executed by the tool agent.
3. The tool agent returns execution feedback (e.g., coordinate accuracy, answer correctness).
Reward Function: A grounded reward ( $r^G_t$ $r_{t}^{G}$ ) is computed based on the actual execution outcome:
- GUI Grounding: Rewards coordinate matching (did the click hit the target?).
- Tool Calling: Rewards answer matching or successful code execution.
Goal: To refine the model's precision, ensuring that the anticipated plans are actually feasible within the environment.

Inference Strategy

During inference, TraceR1 operates in a Plan–Act Loop:

Given the current state, it predicts a multi-step future trajectory.
It executes only the first action via the tool agent.
It receives updated environmental feedback.
It replans for the next step.
This iterative mechanism allows the agent to maintain foresight while correcting for execution errors in real-time.

3. Key Contributions

TraceR1 Framework: A unified framework that explicitly trains multimodal agents to forecast trajectories of future actions, enabling long-horizon reasoning beyond reactive decision-making.
Two-Stage RL Paradigm: A novel training pipeline that first learns globally coherent plans via trajectory-level optimization and then refines them using grounded, executable feedback. This bridges the gap between high-level reasoning and low-level precision.
Comprehensive Evaluation: Extensive testing across 7 benchmarks (covering online/offline GUI tasks and multimodal tool-use reasoning), demonstrating that anticipatory trajectory reasoning is a key principle for robust agent performance.

4. Experimental Results

TraceR1 was evaluated on benchmarks including AndroidWorld, OSWorld-Verified, GUI-Odyssey, GAIA, and GTA.

GUI Benchmarks (Online & Offline):
- TraceR1 significantly outperforms open-source baselines and approaches the performance of proprietary systems (e.g., GPT-4.1, Claude 4.5).
- On OSWorld-Verified, it improved the success rate of the base UI-TARS-1.5-7B model from 27.4% to 30.9% and Qwen3-VL-32B from 35.6% to 41.2%.
- On AndroidWorld, it achieved 64.8% success rate (vs. 61.4% for the base model), setting a new state-of-the-art for open-source models.
- On AndroidControl-High, it achieved a 75.3% step success rate, outperforming R1-style models by over 40%.
Tool-Use Benchmarks:
- GAIA: TraceR1 achieved 40.2% answer accuracy, outperforming GPT-4o (33.4%) and all other open-source models. It showed a +8.7% improvement over the base Qwen3-VL-8B model.
- GTA: Demonstrated exceptional tool-execution behavior with high ToolAcc (65.7%) and CodeExec (87.4%), confirming the effectiveness of training with tool-usage trajectories.
Ablation Studies:
- Stage 2 Importance: Removing the grounded fine-tuning stage resulted in a ~6% performance drop, proving that execution feedback is critical for stabilizing long-horizon plans.
- Horizon Length: A moderate predictive horizon ( $T \approx 5-10$ ) yields the best results; excessively long horizons ( $T > 10$ ) lead to unstable rewards due to accumulated uncertainty.
- Reward Components: Both the repetition penalty and temporal discount factor are crucial; removing them leads to reward-hacking behaviors and planning instability.

5. Significance

Paradigm Shift: The paper establishes that anticipatory trajectory reasoning is a fundamental requirement for building agents that can operate effectively in complex, real-world environments. It moves the field from reactive "step-by-step" execution to proactive "look-ahead" planning.
Scalability: TraceR1 provides a scalable recipe for training open-source models to rival proprietary systems in planning and reasoning capabilities without relying on expensive world models.
Generalization: The framework successfully bridges diverse interaction modalities (GUIs and general tool-use), suggesting a unified approach for future agentic systems that require both high-level foresight and low-level precision.
Future Directions: The authors suggest extending this paradigm to hierarchical planning and embodied agents, where memory and internal state updates are coupled with trajectory prediction to handle even longer time scales.