Learning Next Action Predictors from Human-Computer Interaction

This paper introduces LongNAP, a user model that leverages a large-scale dataset of 360K annotated multimodal interactions and a hybrid parametric-in-context learning approach to significantly outperform existing baselines in predicting a user's next action by reasoning over their full interaction history.

Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber, Thomas Robinson, Nilam Ram, Byron Reeves, Sherry Yang, Michael S. Bernstein, Diyi Yang

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you have a digital assistant that doesn't just wait for you to ask for help, but actually knows what you're going to do before you even think about it. It's like having a friend who knows you so well that when you pick up your phone, they've already handed you the coffee you were about to reach for, or opened the document you were about to type in.

This paper introduces a system called LongNAP (Long-context Next Action Predictor) that tries to build this kind of "mind-reading" AI. Here's how it works, broken down into simple concepts:

1. The Problem: AI is Too "Short-Sighted"

Right now, most AI models are like people looking through a keyhole. They only see what you type into a chat box (your prompt). They don't know what you were doing five minutes ago, what you were looking at on your screen, or what your habits are. They are reactive, waiting for you to speak.

The authors want to build proactive AI. They want an AI that watches your entire digital life—your clicks, your screen changes, your scrolling—and says, "Ah, based on what I've seen you do for the last month, you're probably about to check your email and then message your co-author."

2. The Data Challenge: The "Passive Observer"

To teach an AI to do this, you need a massive amount of data. But asking people to write down every single thing they do on their computer for a month is impossible (and boring).

The Solution: NAPsack
The team built a tool called NAPsack. Think of NAPsack as a silent, invisible camera that runs in the background of your phone or computer.

  • It doesn't ask you to do anything.
  • It just takes screenshots and records your clicks.
  • Then, it uses a smart AI (a Vision-Language Model) to look at those screenshots and say, "Okay, the user just clicked 'Downloads,' then opened a PDF. That's an action."
  • They collected over 360,000 actions from 20 people over a month. That's 1,800 hours of screen time!

3. The Model: The "Librarian" vs. The "Amnesiac"

Here is the tricky part: You can't feed 1,800 hours of data into a standard AI at once. It's like trying to read an entire library's worth of books in one second. The AI would get confused and forget things.

The Solution: LongNAP (The Smart Librarian)
Instead of trying to memorize everything in its brain (which is slow and rigid), LongNAP acts like a super-smart librarian.

  • Phase 1: Reasoning to Retrieve. When you are doing something now, LongNAP thinks, "Hmm, this looks like a time when the user was stressed about a deadline." It then runs to its memory library and pulls out a specific note from three weeks ago: "Last time the user was stressed, they messaged their co-author to divide the work."
  • Phase 2: Reasoning to Predict. It takes that old note, combines it with what you are doing right now, and predicts: "You are going to message your co-author."

It's not just guessing; it's retrieving past patterns to make a smart guess about the future.

4. How They Taught It: The "Wait and See" Method

How do you know if the AI is right? You can't ask it to guess the future and then grade it immediately.

  • The Trick: They let the AI make a prediction, then they wait.
  • They watch what the user actually does next.
  • If the user did exactly what the AI predicted, the AI gets a "gold star" (a reward). If not, it gets a "try again."
  • They used another AI (an "LLM Judge") to grade how similar the prediction was to reality, acting like a teacher checking homework.

5. The Results: It Actually Works!

The results were impressive:

  • Single User: When trained on just one person, LongNAP was 79% better than standard AI models at guessing what that person would do next.
  • New Users: Even when trained on many people and tested on a new person it had never seen, it still performed better than the competition.
  • Accuracy: About 17% of the time, it predicted the user's next move perfectly. If they only looked at the predictions they were most confident about, that accuracy jumped to 26%.

6. Why This Matters (And the Privacy Catch)

The Good News: This means we are moving toward AI that truly understands us. It could help you by automatically opening the files you need, reminding you of tasks you usually do at this time, or even finishing repetitive tasks for you.

The Privacy Catch: To do this, the AI needs to see everything you do. The paper admits this is a privacy nightmare.

  • The Fix: They suggest running these models locally on your device (like on your own phone) so your data never leaves your house.
  • They also suggest "decentralizing" the data so no single company has a giant database of everyone's secrets.

The Big Picture

Think of LongNAP not as a robot that controls you, but as a digital shadow that learns your rhythm. It's the difference between a GPS that just tells you where you are, and a GPS that knows you always get hungry at 5 PM, so it suggests a restaurant before you even realize you're hungry.

The paper proves that with the right tools to collect data and a smart "librarian" style AI, we can finally build systems that anticipate our needs rather than just waiting for our commands.