PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

Imagine you have a digital assistant living in your phone or computer. Right now, most of these assistants are like very obedient but slightly slow waiters. If you want a coffee, you have to walk up, look them in the eye, and say, "I would like a coffee, please." If you don't speak, they just stand there doing nothing, even if they can see you looking at your watch and yawning, clearly needing caffeine. They are reactive: they wait for your command.

This paper, PIRA-Bench, is about teaching these assistants to be proactive, like a mind-reading butler.

Here is the breakdown of the paper in simple terms:

1. The Problem: The "Obedient Waiter" vs. The "Mind-Reading Butler"

Currently, AI agents are great at following instructions. If you say, "Book a table at that Italian place," they can do it. But real life is messy.

The Mess: You might be chatting with a friend about dinner, then switch to checking your bank account, then scroll through a news app for no reason, then go back to the chat.
The Failure: A standard AI gets confused by this "noise." It might think you want to buy a house just because you looked at a real estate app for three seconds, or it might get so eager to help that it starts booking tables when you were just looking at a menu for fun. It lacks restraint.

The goal of this paper is to build an agent that watches your screen, ignores the boring stuff (like scrolling), figures out what you actually want to do next, and suggests it before you even ask.

2. The Solution: PIRA-Bench (The Test)

To teach AI to be a "mind-reading butler," you need a tough test. The authors created PIRA-Bench, a giant dataset of 100 real-life scenarios.

The Setup: Imagine recording someone's screen for a whole day.
The Twist: These recordings are full of distractions. Sometimes the user is just playing with their phone, sometimes they are multitasking (chatting about dinner while studying for a test).
The Profiles: The test also includes different "user personalities." If the user is a billionaire, the AI should suggest buying a luxury apartment. If the user is a student, it should suggest renting a cheap room.
The Challenge: The AI has to look at this messy stream of images, ignore the "noise," figure out the user's hidden goals, and make a suggestion that fits their personality.

3. The New Tool: PIRF (The Brain Upgrade)

The authors didn't just make a test; they built a new way for AI to think, called PIRF. Think of this as giving the AI a notebook and a memory.

The Notebook (Memory): Instead of looking at one screen and forgetting the rest, the AI keeps a running list of "threads."
- Thread A: "User is planning a dinner."
- Thread B: "User is studying."
- Thread C: "User is just scrolling aimlessly."
The Filter (Reflection): This is the most important part. The AI constantly asks itself: "Is this screen actually part of a plan, or is the user just bored?"
- If the user is just scrolling randomly, the AI says, "IDLE." It does nothing. This prevents it from annoying you with bad suggestions.
- If the user switches back to the dinner chat, the AI says, "RESUME," and picks up the thread where it left off.

4. The Results: "Trigger-Happy" vs. "Wise"

The authors tested the smartest AI models on this new test.

The Old Way (Naive): The AI was like a trigger-happy guard dog. It barked at everything. It guessed the right answer often (high "Recall"), but it also barked at the mailman and the wind (high "Hallucinations"). It was too eager to help, which made it annoying.
The New Way (PIRF): With the new "notebook and filter," the AI became wise. It still guessed the right answers, but it learned to stay quiet when there was no real intent. It stopped making up fake tasks.
The Gap: Even with the upgrade, the AI is still far from a human. A human can look at a screen and know, "Oh, they are just bored, I won't say anything." The AI is still learning that skill.

The Big Takeaway

This paper says: Being smart isn't just about knowing what to do; it's about knowing when not to do anything.

To build a truly helpful AI assistant, we need to stop teaching them to just follow orders and start teaching them to watch, wait, and understand the messy, noisy reality of human life. PIRA-Bench is the new gym where these assistants go to learn how to be good butlers instead of just obedient robots.

1. Problem Statement

Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm. They function as passive executors that require explicit, detailed natural language instructions from users to perform tasks. This approach imposes a significant cognitive burden on users, who must formulate precise commands (e.g., specific times, locations, or names) and often interrupt their workflow to do so.

The paper argues that a true "intelligent AI assistant" should be proactive, capable of:

Anticipating user intentions directly from continuous visual inputs (screenshots).
Offering timely recommendations without explicit prompting.
Handling real-world complexities such as multithreaded task-switching, noisy browsing (idle scrolling, random app switching), and user profile dependencies.

Existing benchmarks focus on instruction-following accuracy and lack the infrastructure to evaluate an agent's ability to infer latent, future goals from passive, continuous visual streams.

2. Methodology

The authors propose a comprehensive solution consisting of a new benchmark, a formal task definition, and a baseline framework.

A. The PIR Task (Proactive Intent Recommendation)

The paper defines the Proactive Intent Recommendation (PIR) task as learning a mapping function $f_\theta$ that predicts a set of future, actionable intents ( $I^*$ ) based on:

Trajectory ( $T$ ): A sequence of $N$ sequential GUI screenshots captured passively.
User Profile ( $P$ ): Encapsulating socio-economic status, preferences, and constraints.
Goal: Maximize the conditional probability $P_\theta(I | T, P)$ to predict intents the user is likely to execute next, even before they type a command.

B. PIRA-Bench (The Benchmark)

To evaluate PIR capabilities, the authors constructed PIRA-Bench, the first dataset specifically designed for this paradigm.

Composition: 100 meticulously curated GUI trajectories (mobile and desktop), averaging 32 screenshots each.
User Profiles: Each trajectory is paired with 3 distinct user profiles to test personalization.
Key Challenges:
1. Interleaved Intents: Trajectories contain multiple concurrent tasks (e.g., switching between studying and planning a meal) that agents must disentangle.
2. Profile-Dependent Prediction: Agents must use the user profile to resolve ambiguity (e.g., recommending a luxury apartment for a wealthy user vs. a budget rental for a student).
3. Noise Rejection: Trajectories include "pure noise" segments (idle scrolling, random browsing) where the correct action is to propose no intent (negative samples).
Evaluation Metrics:
- $F1_{avg}$ : Average F1 score for trajectories with valid intents.
- $FPS_{norm}$ (Normalized False Positive Score): Measures robustness against hallucinations on noise-only trajectories.
- $S_{final}$ : The product of $F1_{avg}$ and $FPS_{norm}$ , serving as the unified reliability score.

C. PIRF (Proactive Intent Recommendation Framework)

To establish a baseline, the authors propose PIRF, a memory-aware, state-tracking architecture designed to wrap general Multimodal Large Language Models (MLLMs).

Dynamic Memory Module: Maintains a list of active "threads" (suspended intents) and anchors the static user profile. It uses a sliding window for recent frames ( $K=10$ ) to manage context length.
State Transition Action Space: The model outputs structured actions at each step:
- CREATE: Initiate a new task thread.
- RESUME: Switch back to a previously suspended task.
- UPDATE: Refine the current active intent.
- IDLE: Explicitly reject the current frame as noise (critical for preventing hallucinations).
Reflection & Auto-Deletion: A mechanism where the model continuously evaluates memory. If visual evidence suggests an intent is abandoned or completed, the framework automatically deletes the corresponding thread to prevent memory bloat and confusion.

3. Key Contributions

Task Definition: Introduced the Proactive Intent Recommendation (PIR) task, shifting the focus from reactive instruction-following to forward-looking intent anticipation.
PIRA-Bench: Created a novel benchmark with 100 real-world trajectories featuring interleaved multitasking, user profile context, and intentional noise to rigorously test disentanglement and filtering capabilities.
PIRF Framework: Proposed a baseline architecture that equips general MLLMs with iterative processing, dynamic memory, and reflection-based auto-deletion, significantly reducing hallucinations in noisy environments.

4. Experimental Results

The authors evaluated four state-of-the-art MLLMs (Gemini-3.1-Pro, GPT-5.2, Qwen3.5-Plus, Seed-1.8) using both a Naive Baseline (sliding context only) and the PIRF framework.

Naive Baseline Performance: Models suffered from "over-proactivity." For instance, GPT-5.2 achieved high recall (83.37%) but critically low precision (31.95%) and noise robustness, leading to a low final score ( $S_{final} = 12.76$ ). It tended to hallucinate intents during idle periods.
PIRF Performance: The framework significantly improved all models.
- GPT-5.2: Precision improved by ~18 points, and $S_{final}$ nearly doubled (12.76 $\to$ 24.00).
- Seed-1.8: Achieved the highest overall score ( $S_{final} = 28.05$ ) due to superior noise rejection ( $FPS_{norm} = 50.36$ ), demonstrating that "operational restraint" is as vital as reasoning capability.
Human vs. AI: Human evaluators achieved an $S_{final}$ of 90.35, far surpassing the best model (28.05). The gap is primarily driven by precision and noise robustness; humans rarely hallucinate intents during noise, whereas even top models struggle with false positives.
Ablation Study: Removing noise from trajectories ("Clean" vs. "Noised") revealed that current models have high precision in idealized settings (e.g., GPT-5.2 Precision $\approx$ 92%) but collapse in noisy environments (Precision $\approx$ 50%). This confirms that visual clutter is the primary cause of hallucination.

5. Significance

Paradigm Shift: The paper marks a critical transition in GUI automation from "instruction-following" to "intent-anticipation," a necessary step for creating truly autonomous personal assistants.
Robustness over Sensitivity: The results highlight that for proactive agents, operational restraint (knowing when not to act) is more critical than raw sensitivity. A helpful assistant must avoid flooding the user with hallucinations during idle moments.
Future Direction: The work establishes that structured state tracking and self-reflection mechanisms are viable pathways to mitigate hallucinations. It sets a rigorous standard for future research, emphasizing that the next generation of agents must be not only smarter but also more discerning about when to intervene.