SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

This paper introduces SpreadsheetArena, a platform for evaluating large language models' end-to-end spreadsheet generation capabilities through blind pairwise comparisons, revealing that while models can produce functional workbooks, they often fail to align with domain-specific best practices and that user preferences vary significantly across different use cases.

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you've asked a team of AI chefs to create a complex, multi-course meal (a spreadsheet) based on a vague recipe you wrote on a napkin. Some chefs might make a dish that tastes great but looks messy. Others might make a beautiful presentation that tastes bland. How do you decide who is the best chef?

This paper, SPREADSHEETARENA, is essentially a giant, blind taste-test competition for AI models trying to build spreadsheets. Here is the breakdown in simple terms:

1. The Arena: A Blind Taste Test

The researchers built a platform called SPREADSHEETARENA. Think of it like a "Taste-Off" for spreadsheets.

  • The Setup: A user types a request (e.g., "Make a budget for a hotel" or "Create a game of checkers in a spreadsheet").
  • The Contest: The system asks 16 different AI models (like Claude, GPT-5, Gemini, etc.) to build that spreadsheet.
  • The Vote: You, the user, see two spreadsheets side-by-side. You don't know which AI made which one. You vote for the one you like better, or say they are both bad.
  • The Score: Based on thousands of these votes, the AIs get an "Elo rating" (like a chess ranking). The higher the score, the better the AI is at making spreadsheets people actually like.

The Result: Currently, the Claude family of models is winning the most votes, acting like the "Michelin-star chefs" of the spreadsheet world.

2. The Twist: It's Not Just About "Right or Wrong"

In coding, a program either works or it crashes. In spreadsheets, it's much more subjective.

  • The "Look" Matters: The researchers found that users love spreadsheets that look polished. If an AI adds bold text, colors, or borders, people are more likely to vote for it, even if the math is slightly off.
  • The "Length" Bias: Just like people prefer longer answers in a chat, people prefer spreadsheets that are "fuller"—more cells filled, more sheets, more text.
  • The Trap: Sometimes an AI wins because it made a spreadsheet that looks professional but is actually broken inside. It's like a cake that looks beautiful but is raw dough in the middle.

3. The "Feature Adjustment": Peeling Back the Paint

The researchers realized that some AIs were winning just because they were good at "painting the house" (formatting) rather than "building the foundation" (math).

  • They created a statistical tool to "strip away" the formatting and look at the raw performance.
  • The Surprise: When they adjusted for these superficial features, the rankings changed!
    • Gemini 3 Pro jumped up the leaderboard (it was actually building better foundations than we thought).
    • Claude Opus 4.5 dropped a few spots (it was relying a bit too much on looking pretty).
  • The Lesson: A spreadsheet that looks good isn't always the best spreadsheet.

4. The "Domain" Problem: One Size Does Not Fit All

The paper discovered that what makes a "good" spreadsheet depends entirely on what you are making.

  • Academic Research: Scientists prefer raw, unformatted data. They hate fancy colors. If an AI adds too much formatting here, it actually loses points.
  • Professional Finance: Bankers and investors have strict rules. They need specific colors (blue for inputs, black for formulas) and specific layouts.
  • The Conflict: The "crowd" voting in the arena often prefers pretty, colorful spreadsheets. But when financial experts looked at the same spreadsheets, they were often horrified. They found that even the "winning" AIs frequently broke the golden rules of financial modeling (like hard-coding numbers instead of linking them).

5. The Failure Modes: How AIs Mess Up

The researchers categorized how AIs fail, finding that different models have different "personalities" of failure:

  • The "Lazy" Model: Misses parts of the prompt entirely (e.g., forgets to make the "Profit" tab).
  • The "Confused" Model: The math is linked, but the logic is wrong (e.g., calculating tax on the wrong number).
  • The "Messy" Model: The math is right, but it's impossible to read because the formatting is a disaster.
  • The "Deceptive" Model: This is the most dangerous. The spreadsheet looks perfect and professional, but the internal logic is broken. It's a "wolf in sheep's clothing."

The Big Takeaway

This paper tells us that making a spreadsheet is much harder for AI than just writing code.

  • It requires balancing math (is it right?), structure (is it organized?), and style (does it look good?).
  • Current AI models are getting better at the "style" part, but they still struggle with the deep, professional rules required for high-stakes jobs like finance.
  • The Future: We need to train AI not just to make things that look good to a crowd, but to follow the strict, invisible rules that experts use to keep spreadsheets safe and accurate.

In short: AI is great at making spreadsheets that look like a spreadsheet, but we still need humans to check if the spreadsheet actually works like a spreadsheet.