Learning Next Action Predictors from Human-Computer Interaction

Imagine you have a digital assistant that doesn't just wait for you to ask for help, but actually knows what you're going to do before you even think about it. It's like having a friend who knows you so well that when you pick up your phone, they've already handed you the coffee you were about to reach for, or opened the document you were about to type in.

This paper introduces a system called LongNAP (Long-context Next Action Predictor) that tries to build this kind of "mind-reading" AI. Here's how it works, broken down into simple concepts:

1. The Problem: AI is Too "Short-Sighted"

Right now, most AI models are like people looking through a keyhole. They only see what you type into a chat box (your prompt). They don't know what you were doing five minutes ago, what you were looking at on your screen, or what your habits are. They are reactive, waiting for you to speak.

The authors want to build proactive AI. They want an AI that watches your entire digital life—your clicks, your screen changes, your scrolling—and says, "Ah, based on what I've seen you do for the last month, you're probably about to check your email and then message your co-author."

2. The Data Challenge: The "Passive Observer"

To teach an AI to do this, you need a massive amount of data. But asking people to write down every single thing they do on their computer for a month is impossible (and boring).

The Solution: NAPsack
The team built a tool called NAPsack. Think of NAPsack as a silent, invisible camera that runs in the background of your phone or computer.

It doesn't ask you to do anything.
It just takes screenshots and records your clicks.
Then, it uses a smart AI (a Vision-Language Model) to look at those screenshots and say, "Okay, the user just clicked 'Downloads,' then opened a PDF. That's an action."
They collected over 360,000 actions from 20 people over a month. That's 1,800 hours of screen time!

3. The Model: The "Librarian" vs. The "Amnesiac"

Here is the tricky part: You can't feed 1,800 hours of data into a standard AI at once. It's like trying to read an entire library's worth of books in one second. The AI would get confused and forget things.

The Solution: LongNAP (The Smart Librarian)
Instead of trying to memorize everything in its brain (which is slow and rigid), LongNAP acts like a super-smart librarian.

Phase 1: Reasoning to Retrieve. When you are doing something now, LongNAP thinks, "Hmm, this looks like a time when the user was stressed about a deadline." It then runs to its memory library and pulls out a specific note from three weeks ago: "Last time the user was stressed, they messaged their co-author to divide the work."
Phase 2: Reasoning to Predict. It takes that old note, combines it with what you are doing right now, and predicts: "You are going to message your co-author."

It's not just guessing; it's retrieving past patterns to make a smart guess about the future.

4. How They Taught It: The "Wait and See" Method

How do you know if the AI is right? You can't ask it to guess the future and then grade it immediately.

The Trick: They let the AI make a prediction, then they wait.
They watch what the user actually does next.
If the user did exactly what the AI predicted, the AI gets a "gold star" (a reward). If not, it gets a "try again."
They used another AI (an "LLM Judge") to grade how similar the prediction was to reality, acting like a teacher checking homework.

5. The Results: It Actually Works!

The results were impressive:

Single User: When trained on just one person, LongNAP was 79% better than standard AI models at guessing what that person would do next.
New Users: Even when trained on many people and tested on a new person it had never seen, it still performed better than the competition.
Accuracy: About 17% of the time, it predicted the user's next move perfectly. If they only looked at the predictions they were most confident about, that accuracy jumped to 26%.

6. Why This Matters (And the Privacy Catch)

The Good News: This means we are moving toward AI that truly understands us. It could help you by automatically opening the files you need, reminding you of tasks you usually do at this time, or even finishing repetitive tasks for you.

The Privacy Catch: To do this, the AI needs to see everything you do. The paper admits this is a privacy nightmare.

The Fix: They suggest running these models locally on your device (like on your own phone) so your data never leaves your house.
They also suggest "decentralizing" the data so no single company has a giant database of everyone's secrets.

The Big Picture

Think of LongNAP not as a robot that controls you, but as a digital shadow that learns your rhythm. It's the difference between a GPS that just tells you where you are, and a GPS that knows you always get hungry at 5 PM, so it suggests a restaurant before you even realize you're hungry.

The paper proves that with the right tools to collect data and a smart "librarian" style AI, we can finally build systems that anticipate our needs rather than just waiting for our commands.

Here is a detailed technical summary of the paper "Learning Next Action Predictors from Human-Computer Interaction":

1. Problem Definition: Next Action Prediction (NAP)

The paper addresses the limitation of current AI systems, which rely on sparse user prompts and lack deep contextual understanding of user behavior. The authors formalize Next Action Prediction (NAP) as a task where an AI must predict a user's immediate future actions based on a continuous, multimodal history of their interactions.

Input: A temporal stream of user interaction events $E = \{e_1, ..., e_T\}$ , where each event consists of an action (e.g., "clicks button") and optional visual observations (screenshots) and I/O data (keystrokes, mouse movements).
Output: A predicted trajectory of future events $\hat{E}_{t+1:t+h}$ over a specific horizon $h$ .
Challenge: The space of possible next actions is unbounded (thousands of outcomes), and user behavior is highly idiosyncratic, requiring models to reason over long, multimodal contexts rather than just recent prompts.

2. Methodology

The paper introduces a two-part solution: a scalable data collection pipeline (NAPsack) and a specialized prediction model (LongNAP).

A. Data Collection: NAPsack

To overcome the scarcity of labeled, naturalistic interaction data, the authors developed NAPsack, a passive annotation pipeline.

Passive Collection: Instead of asking users to annotate tasks, NAPsack records screenshots and I/O events in the background.
Compression Strategy: It groups interactions into "bursts" and only stores screenshots when active interaction occurs, significantly reducing storage needs (by ~70% compared to naive recording).
Automated Labeling: A Vision-Language Model (VLM) processes chunks of screenshots and I/O events to generate high-level natural language action descriptions (e.g., "Clicked 'Downloads' folder").
Dataset: The pipeline was used to annotate 360,000 actions across 20 users over one month, totaling 1,800 hours of screen time (1.9M screenshots).

B. The Model: LongNAP (Long-context Next Action Predictor)

LongNAP is designed to reason over long interaction histories without being constrained by fixed context windows. It employs a two-phase retrieval-augmented generation architecture trained via Reinforcement Learning (RL).

Phase 1: Reasoning to Retrieve
- Given the current context, the model generates a reasoning trace (chain-of-thought) about the user's current state.
- This trace serves as a query for a lexical retriever (BM25) to search a memory bank of past observations and reasoning traces.
- Goal: Retrieve relevant historical patterns (e.g., "User tends to message coauthors after reviewing papers").
Phase 2: Reasoning to Predict
- The model integrates the retrieved traces with the current context to refine its reasoning.
- It generates a final prediction of the user's next actions.
- High-reward predictions are saved back to the memory bank, allowing the system to improve over time.
Training Objective
- Reward Signal: The model is trained using Policy Gradient (GRPO). The reward is a temporal reward calculated by an LLM-as-a-Judge (Gemini 3.0 Flash).
- Evaluation Metric: The LLM judge scores the semantic similarity (0–1 scale) between the predicted trajectory and the actual ground-truth future actions observed in the data.
- Optimization: The model is optimized end-to-end to maximize the similarity score, learning to retrieve the right context and reason effectively.

3. Key Contributions

NAPsack Pipeline: An open-source, passive data collection and annotation system that generates large-scale, labeled interaction data without user effort, reducing storage costs by 75% while maintaining high annotation quality.
LongNAP Architecture: A novel model combining parametric learning with in-context learning and retrieval. It uniquely reasons to retrieve relevant past traces before predicting, enabling it to handle unbounded interaction histories.
Empirical Validation: A comprehensive evaluation on 20 users demonstrating that learning from full behavioral context significantly outperforms standard baselines.
Privacy-Aware Design: The architecture supports local inference and training, addressing privacy concerns inherent in collecting sensitive user data.

4. Results

The authors evaluated LongNAP against Supervised Fine-Tuning (SFT), Zero-shot Prompting, and Few-shot RAG baselines using an LLM-judge similarity score.

Single-User Performance:
- LongNAP outperformed SFT by 79% (0.21 $\to$ 0.38 score).
- It outperformed Zero-shot prompting by 106% and Few-shot prompting by 88%.
- It also significantly outperformed closed-source baselines (Gemini 3.0 Flash) by 39–43%.
- Human Evaluation: LongNAP achieved a 79% win rate against other methods in pairwise human comparisons.
Cross-User Generalization:
- When trained on multiple users and tested on unseen users, LongNAP still outperformed baselines, though gains were more modest (13% improvement over the strongest baseline).
- This suggests that while user-specific weights are highly effective, the retrieval mechanism allows for some generalization of behavioral patterns.
Prediction Accuracy:
- 17.1% of LongNAP's predictions achieved a high similarity score ( $\ge 0.5$ ) with ground truth.
- When filtering for high-confidence predictions (low variance among sampled trajectories), accuracy rose to 26%.
- Ablation Studies: Removing the reasoning component dropped performance by 19.2%, and removing the retriever dropped it by 15.2%, confirming both are critical.

5. Significance and Future Implications

Proactive AI: The work demonstrates that AI can move beyond reactive prompt-following to proactive anticipation of user needs by reasoning over full behavioral contexts.
Scalable Data: It proves that high-quality, labeled interaction data can be generated at scale passively, solving a major bottleneck in training personal AI models.
Personalization: The results suggest that "user models" trained on individual behavior are superior to generic models, paving the way for highly personalized digital assistants.
Privacy & Ethics: The paper highlights the "privacy paradox"—users may disclose more to proactive AI for convenience. It proposes decentralized, on-device training as a mitigation strategy.
Applications: The authors envision applications like SleepWalk (an assistant that autonomously executes predicted tasks) and powerNAP (an online, continuous learning system that adapts to user behavior drift in real-time).

In conclusion, the paper establishes that learning from the full context of user behavior is a viable and highly effective approach for building next-generation, proactive AI systems.