SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Imagine you've asked a team of AI chefs to create a complex, multi-course meal (a spreadsheet) based on a vague recipe you wrote on a napkin. Some chefs might make a dish that tastes great but looks messy. Others might make a beautiful presentation that tastes bland. How do you decide who is the best chef?

This paper, SPREADSHEETARENA, is essentially a giant, blind taste-test competition for AI models trying to build spreadsheets. Here is the breakdown in simple terms:

1. The Arena: A Blind Taste Test

The researchers built a platform called SPREADSHEETARENA. Think of it like a "Taste-Off" for spreadsheets.

The Setup: A user types a request (e.g., "Make a budget for a hotel" or "Create a game of checkers in a spreadsheet").
The Contest: The system asks 16 different AI models (like Claude, GPT-5, Gemini, etc.) to build that spreadsheet.
The Vote: You, the user, see two spreadsheets side-by-side. You don't know which AI made which one. You vote for the one you like better, or say they are both bad.
The Score: Based on thousands of these votes, the AIs get an "Elo rating" (like a chess ranking). The higher the score, the better the AI is at making spreadsheets people actually like.

The Result: Currently, the Claude family of models is winning the most votes, acting like the "Michelin-star chefs" of the spreadsheet world.

2. The Twist: It's Not Just About "Right or Wrong"

In coding, a program either works or it crashes. In spreadsheets, it's much more subjective.

The "Look" Matters: The researchers found that users love spreadsheets that look polished. If an AI adds bold text, colors, or borders, people are more likely to vote for it, even if the math is slightly off.
The "Length" Bias: Just like people prefer longer answers in a chat, people prefer spreadsheets that are "fuller"—more cells filled, more sheets, more text.
The Trap: Sometimes an AI wins because it made a spreadsheet that looks professional but is actually broken inside. It's like a cake that looks beautiful but is raw dough in the middle.

3. The "Feature Adjustment": Peeling Back the Paint

The researchers realized that some AIs were winning just because they were good at "painting the house" (formatting) rather than "building the foundation" (math).

They created a statistical tool to "strip away" the formatting and look at the raw performance.
The Surprise: When they adjusted for these superficial features, the rankings changed!
- Gemini 3 Pro jumped up the leaderboard (it was actually building better foundations than we thought).
- Claude Opus 4.5 dropped a few spots (it was relying a bit too much on looking pretty).
The Lesson: A spreadsheet that looks good isn't always the best spreadsheet.

4. The "Domain" Problem: One Size Does Not Fit All

The paper discovered that what makes a "good" spreadsheet depends entirely on what you are making.

Academic Research: Scientists prefer raw, unformatted data. They hate fancy colors. If an AI adds too much formatting here, it actually loses points.
Professional Finance: Bankers and investors have strict rules. They need specific colors (blue for inputs, black for formulas) and specific layouts.
The Conflict: The "crowd" voting in the arena often prefers pretty, colorful spreadsheets. But when financial experts looked at the same spreadsheets, they were often horrified. They found that even the "winning" AIs frequently broke the golden rules of financial modeling (like hard-coding numbers instead of linking them).

5. The Failure Modes: How AIs Mess Up

The researchers categorized how AIs fail, finding that different models have different "personalities" of failure:

The "Lazy" Model: Misses parts of the prompt entirely (e.g., forgets to make the "Profit" tab).
The "Confused" Model: The math is linked, but the logic is wrong (e.g., calculating tax on the wrong number).
The "Messy" Model: The math is right, but it's impossible to read because the formatting is a disaster.
The "Deceptive" Model: This is the most dangerous. The spreadsheet looks perfect and professional, but the internal logic is broken. It's a "wolf in sheep's clothing."

The Big Takeaway

This paper tells us that making a spreadsheet is much harder for AI than just writing code.

It requires balancing math (is it right?), structure (is it organized?), and style (does it look good?).
Current AI models are getting better at the "style" part, but they still struggle with the deep, professional rules required for high-stakes jobs like finance.
The Future: We need to train AI not just to make things that look good to a crowd, but to follow the strict, invisible rules that experts use to keep spreadsheets safe and accurate.

In short: AI is great at making spreadsheets that look like a spreadsheet, but we still need humans to check if the spreadsheet actually works like a spreadsheet.

Here is a detailed technical summary of the paper "SPREADSHEETARENA: Decomposing Preference in LLM Generation of Spreadsheet Workbooks."

1. Problem Statement

Large Language Models (LLMs) are increasingly tasked with generating structured artifacts (e.g., code, tables, SQL). However, end-to-end spreadsheet generation presents unique challenges distinct from general chat or code generation:

High Dimensionality & Structure: Spreadsheets encode dense, graph-structured dependencies between cells and formulas, often spanning multiple sheets.
Context-Dependent Criteria: Evaluation criteria vary significantly by domain (e.g., academic research vs. professional finance). A "good" spreadsheet in one context (e.g., minimal formatting for raw data analysis) may be "bad" in another (e.g., strict color-coding conventions for financial modeling).
Evaluation Gap: While programmatic verification (checking for syntax errors) is possible, it fails to capture subjective qualities like layout, readability, interactivity, and adherence to domain-specific best practices. Existing benchmarks often lack the nuance to evaluate these holistic qualities.

The paper posits that human preference is critical for evaluating spreadsheet generation but requires a structured platform to decompose why certain outputs are preferred over others.

2. Methodology: SPREADSHEETARENA

The authors introduce SPREADSHEETARENA, a platform for blind, pairwise preference evaluation of LLM-generated spreadsheets.

A. Task Formulation

Input: Users submit natural language prompts describing a use case (e.g., "Build a hotel proforma model").
Output: LLMs generate a JSON representation of a spreadsheet workbook (using a custom schema called SheetSpec@2). This JSON includes cell content (text, numbers, formulas), sheet structure, and styling (borders, fills, fonts).
Rendering: The JSON is deterministically rendered in a client-side browser using SpreadJS, allowing users to interact with the spreadsheet (view formulas, toggle sheets).

B. Evaluation Protocol

Blind Pairwise Battles: For each prompt, users are shown two anonymous spreadsheets generated by different models.
Voting: Users vote for the preferred spreadsheet, indicate a tie, or mark both as unsatisfactory.
Scale: The platform collected 4,357 blind preference votes across 16 models and 436 seed prompts (categorized into 6 domains: Academic, Corporate Finance, Creative, Operations, Professional Finance, SMB).

C. Statistical Modeling

To derive rankings and decompose preferences, the authors employ an augmented Bradley-Terry (BT) model:

Baseline Ranking: Standard BT model estimates latent strength coefficients ( $\theta$ ) for each model based on win/loss outcomes. These are converted to Elo ratings (anchored at 1000 for GPT-4o).
Feature-Augmented BT: To understand what drives preferences, the model is extended to include covariates derived from the spreadsheet artifacts:
$P(A \succ B) = \sigma \left( \theta_A - \theta_B + \sum_{k=1}^{K} \beta_k (X_{Ak} - X_{Bk}) \right)$
Where $X_{ik}$ represents measurable features (e.g., text density, formula error rate, number of sheets) extracted from model $i$ 's output.
Feature Adjustment: By setting feature contributions to zero, the authors calculate "feature-adjusted" Elo scores to see how rankings change when controlling for observable output characteristics.

D. Failure Taxonomy & Expert Evaluation

Failure Taxonomy: Using an LLM judge, the authors categorized losing spreadsheets into 7 failure modes (e.g., Non-functional, Spec Non-compliance, Integrity Failure, Presentation Deficiency).
Expert Study: A subset of 52 finance-domain battles was evaluated by 5 professional financial modelers using a 6-dimension rubric (Errors, Formula Conventions, Color Coding, Structure, Modeling Conventions, Utility) to compare crowd preferences against industry standards.

3. Key Contributions

SPREADSHEETARENA Platform: The first large-scale, arena-style evaluation platform specifically for end-to-end spreadsheet generation, hosting 4,357 votes and a live leaderboard.
Decomposition of Preference: Demonstrated that observable spreadsheet features (formatting, text density, structure) significantly influence user preferences, often more than raw model identity.
Domain-Specific Insights: Revealed that feature importance varies drastically by domain (e.g., formatting hurts academic preferences but is crucial for finance).
Expert vs. Crowd Gap: Highlighted a misalignment between crowd preferences and professional standards, particularly in financial modeling conventions.
Dataset Release: Plans to release a dataset of prompts, generated spreadsheets, and preference votes for further research.

4. Key Results

A. Global Rankings & Feature Adjustment

Baseline: Claude Opus 4.5 leads the leaderboard (Elo ~1550), followed by other Claude and Gemini models.
Leaderboard Compression: When adjusting for 29 extracted features (e.g., text density, fill ratio, error rates), the Elo distribution compresses significantly.
- Example: Claude Opus 4.5 drops 217 points (1550 $\to$ 1333) after adjustment, suggesting its high win rate is partly driven by stylistic features (e.g., heavy formatting) rather than pure structural capability.
- Example: Gemini 3 Pro rises from 4th to 2nd place after adjustment, indicating its performance is less dependent on superficial features than its competitors.
Feature Significance:
- Positive Drivers: Text density, background fills, and numeric content correlate with higher win rates.
- Negative Drivers: Formula error rates are the strongest negative predictor.
- Surprise: Formula sophistication (e.g., use of lookup functions) was not statistically significant in predicting wins, suggesting users prioritize readability and structure over complex logic.

B. Domain-Specific Variations

Academic & Research: Users prefer minimal formatting. Heavy number formatting (e.g., currency symbols) negatively impacts perceived quality. Grok 4 and GPT-5.1 perform exceptionally well here, while Claude models drop significantly in rank.
Finance: Adherence to professional conventions (e.g., color coding: blue for inputs, black for formulas) is a significant positive driver. Models that ignore these conventions lose even if the math is correct.

C. Failure Modes

Presentation Deficiency is the most common failure mode across all models (57–96% of losses).
Model Signatures:
- Weaker models (e.g., Llama 4, Qwen3) frequently fail due to Spec Non-compliance (missing required sections) and Non-functional outputs (broken formulas).
- Stronger models (e.g., Claude Opus 4.5) rarely fail on completeness but lose due to Integrity Failures (hardcoded checks, broken links) and Numerical Computation Errors. These are "silent" failures that are harder for non-experts to detect but critical for functionality.

D. Expert vs. Crowd Alignment

In the finance domain, expert ratings agreed with crowd (arena) preferences only 42.3% of the time.
Color Coding & Formatting was the weakest dimension in expert evaluations (Mean score 1.95/5), with 77.6% of spreadsheets scoring ≤2. Experts found that LLMs rarely follow professional color-coding standards (blue inputs, black formulas), a nuance often missed by general users who may prefer "pretty" but non-standard layouts.

5. Significance and Implications

Beyond Code Generation: The paper establishes that spreadsheet generation is a distinct, high-stakes task requiring evaluation beyond simple syntax checking. It bridges the gap between code generation (logic-focused) and creative writing (style-focused).
Post-Training Guidance: The findings suggest that standard Reinforcement Learning from Human Feedback (RLHF) using pairwise preferences may inadvertently optimize for superficial features (like formatting density) rather than functional correctness or domain adherence.
Need for Specialized Evaluation: The disconnect between crowd preferences and expert standards in finance implies that future LLM training for structured tasks requires domain-specific rubrics and potentially expert-in-the-loop evaluation rather than relying solely on general crowd-sourced data.
Future Work: The authors call for strategies to improve "auditability" and "structural integrity" in LLM outputs, moving beyond generating "pretty" spreadsheets to generating "correct and usable" ones.

In summary, SPREADSHEETARENA reveals that while LLMs can generate syntactically valid spreadsheets, their ability to produce functionally robust and domain-appropriate artifacts is inconsistent. User preferences are heavily influenced by visual and structural cues, which can mask underlying logical flaws, necessitating a more nuanced approach to evaluation and training for structured artifact generation.