Technical Summary: A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Problem Statement

While predictive code completion has significantly accelerated developer productivity in software engineering, analogous features for spreadsheet authoring remain virtually non-existent. Despite the ubiquity of spreadsheets, current automation tools are constrained to specific scenarios (e.g., formula generation or column derivation via FlashFill) or require explicit user invocation of natural language agents. For routine, repetitive edits, the overhead of prompting and waiting for responses often exceeds the cost of direct manipulation, leading users to default to manual entry.

The primary barriers to developing generalized next-action predictors for spreadsheets are twofold:

Data Scarcity: Unlike code, which has detailed version histories, public spreadsheet corpora lack fine-grained edit histories. Existing datasets typically only capture static snapshots or high-level evolution.
Evaluation Complexity: The space of spreadsheet actions is complex, involving spatial, temporal, and composite operations. Furthermore, a static "given history $x$ , predict next action $y$ " evaluation (teacher-forced) fails to capture the dynamic nature of user interaction, where accepted predictions alter the future state and subsequent user needs.

Methodology

1. Benchmark Dataset Construction

To address the lack of edit histories, the authors curated a dataset of 52 high-quality trajectories totaling 11,907 operations. These trajectories reconstruct the creation of spreadsheets from static, public workbooks. The construction pipeline involves three stages:

Symbolic Cold-Start: A vision-language model (VLM) annotates static sheets with semantic metadata (regions, dependencies, pasted ranges). Symbolic heuristics then decompose the final state into cell-level operations, merging adjacent identical operations into range actions.
LLM Refinement: An LLM-based judge-editor loop identifies and corrects unnatural patterns in the symbolic sequences (e.g., consolidating scattered cell-by-cell formatting into range operations, removing stray formatting).
Human Annotation: Human annotators perform a final pass to correct remaining unnatural subsequences. This step is substantial; the mean normalized edit distance between pre-annotation and final trajectories is 0.69, with 19 of 52 trajectories effectively rewritten from scratch.

The dataset covers diverse operations including input, merging, formatting (font, fill, border, alignment), pasting, and autofill.

2. Online Evaluation Framework

The paper proposes an online evaluation framework that simulates a real user workflow, moving beyond static step-wise scoring.

Process: The system observes a history of $n$ actions and predicts a sequence of zero or more actions.
Acceptance/Rejection: Based on an acceptance heuristic (e.g., precision thresholds, user action savings), the prediction is either accepted or rejected.
State Adaptation:
- If Accepted: The future ground-truth trajectory is dynamically updated. Successful predictions remove corresponding future operations. False positives trigger the insertion of inverse operations (e.g., clearing a wrong fill) to undo errors.
- If Rejected: The prediction is discarded, and the next ground-truth user action is added to the history.
Termination: The loop repeats until the target spreadsheet is reached or a step threshold is exceeded.

3. Metrics

The framework computes metrics at three granularities:

Property/Action Level: Classifies individual (cell, property) pairs as True Positives (TP), False Positives (FP), False Negatives (FN), or Mismatches (MM).
Prediction Level: Measures Precision (fraction of correct properties) and User Actions Saved (UAS), which quantifies the net reduction in user effort if the prediction were accepted.
Emulation Level: Tracks Acceptance Rate (AR), Average Precision, and Predictability Coverage (PCOV)—the fraction of theoretically predictable actions (determined by an oracle) that the system actually produced.

4. Baseline Solvers

The framework evaluates three families of solvers:

Zero-shot LLMs: Models (GPT-5 variants) prompted with history and operation syntax.
Fine-tuned SLMs: SmolLM2 models (135M and 360M parameters) trained on synthetic operation sequences.
Classical ML: N-gram models (trained and online), LSTM, and XGBoost.

Key Results

Learnability: The task is learnable. There is a clear correlation between model capability and performance. GPT-5 with reasoning achieves 32.7% UAS in single-action repredict settings, while GPT-5 mini achieves 18.0%. Fine-tuned SmolLM2-360M (26.8% UAS) approaches the performance of GPT-5 (27.4%) despite being significantly smaller.
The Importance of Abstention: Models that lack the ability to abstain perform poorly. The "ALWAYS" heuristic (accepting every prediction) yields -19.2% UAS (net negative savings) due to low precision (9.3%). This confirms that knowing when not to predict is as critical as prediction accuracy.
Trigger Frequency: Invoking the predictor after every user action ( $s=1$ ) yields the highest UAS (27.4%) despite a lower acceptance rate (30.9%) compared to less frequent triggers. This suggests that cheap, frequent triggers are valuable, as users can reject incorrect suggestions without significant penalty.
Action Categories: Content-heavy operations (Input, Paste, Fill) are accepted at higher rates than presentational ones (Align, Border). Fine-tuning significantly improves performance on structural categories (Border, Fill, Autofill) where base models struggled.
Context Length: Increasing the context window from 32 to 128 operations improves UAS, but gains diminish rapidly beyond 128, suggesting most predictive signal resides in recent history.
Prediction Length: In multi-action settings, unlimited prediction scope performs best. Constraining the number of actions per prediction reduces UAS, indicating models self-regulate well when allowed to emit longer sequences for repetitive patterns.

Significance and Contributions

The paper makes three primary contributions:

Benchmark Dataset: The first curated dataset of 52 spreadsheet creation trajectories (11,907 operations) with human-validated ground truth, addressing the critical lack of edit history data.
Online Evaluation Framework: A novel evaluation methodology that models user acceptance behavior and dynamically adapts ground-truth trajectories. This captures real-world utility and error compounding, which static offline evaluations miss.
Design Insights: By applying this framework to various baselines, the authors demonstrate that:
- Action prediction is a viable task for both large and small models.
- Abstention mechanisms are crucial for utility; models must learn to suppress predictions when confidence is low.
- Cheap triggers (frequent prediction attempts) are more effective than waiting for high-confidence moments.
- Fine-tuning on domain-specific operation sequences allows small models to rival large zero-shot LLMs.

The authors conclude that this benchmark and framework provide a necessary foundation for developing proactive, modeless assistants for spreadsheets, bridging the gap between code completion and spreadsheet productivity. They explicitly encourage research into less energy-intensive methods (like the fine-tuned SLMs) to solve this problem.

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets