FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

Imagine you are a detective trying to solve a mystery, but instead of reading a clear novel, you are handed a massive, messy, handwritten ledger filled with numbers, crossed-out lines, and sticky notes. Now, imagine you hire a super-smart AI assistant to read this ledger, find specific clues, and do the math for you.

This paper, FinSheet-Bench, is essentially a report card on how well these AI detectives are doing at that specific job.

Here is the breakdown in simple terms:

1. The Problem: The "Messy Ledger"

In the world of finance (like Private Equity), people don't use neat, clean databases. They use Excel spreadsheets. These are often chaotic:

They have merged cells (where one big box covers four smaller ones).
They have headers that span three rows.
They use bold text or colors to mean "this is a total" or "this is a warning."
They have multiple tabs (sheets) that reference each other.

For a human, this is manageable. For an AI, it's a nightmare. The AI is like a robot that can only read a long string of text. When you feed it an Excel file, the computer has to turn that 2D grid into a 1D line of text. In doing so, the AI loses the "visual map" (like knowing that a bold number is a total because it's at the bottom of a column).

2. The Test: "FinSheet-Bench"

Since real financial data is top-secret (like a bank vault), the researchers couldn't use real documents. So, they built a giant, fake financial universe.

They took the structure of real, messy financial spreadsheets.
They filled them with fake company names and fake numbers.
They created 24 different versions of these spreadsheets, ranging from "easy" (a small list) to "nightmare" (a massive document with 152 companies and 8 different funds).

Then, they asked 10 different top-tier AI models (from companies like OpenAI, Google, and Anthropic) to answer questions like:

"How many funds are there?" (Easy)
"List all companies in the newest fund." (Medium)
"Calculate the average debt-to-EBITDA ratio for all funds." (Hard)

3. The Results: The "Smart but Clumsy" AI

The results were a mix of "Wow" and "Yikes."

The Good News: The AIs are getting much smarter. Two years ago, they were barely passing. Now, the best AI (Gemini 3.1 Pro) gets about 82% accuracy. That sounds good, right?
The Bad News: In the world of finance, 82% is a disaster.
- If you are writing a poem, 82% is great.
- If you are managing billions of dollars, getting 1 out of every 6 answers wrong is unacceptable. It's like a pilot landing a plane with a 1-in-6 chance of crashing.
- The paper notes that for a professional to trust an AI, they need about 97% accuracy. We aren't there yet.

The "Complexity Cliff":
The AIs did great on simple questions (like "How many funds?"). But as soon as the task required math or sorting (like "Find the company with the highest debt and calculate the average"), their scores plummeted.

On simple lookups: ~90% accuracy.
On complex math: ~20–30% accuracy.

It's as if the AI is a brilliant librarian who can find a book on a shelf instantly, but if you ask it to add up the prices of 50 books on that shelf, it starts counting on its fingers and gets the total wrong.

4. Why Do They Fail?

The paper suggests two main reasons:

The Translation Problem: Converting a visual spreadsheet into text is like translating a map into a poem. You lose the spatial clues (like "this number is at the bottom, so it's a sum").
The Math Problem: AIs are great at predicting the next word in a sentence. They are not great at being calculators. They try to "guess" the math based on patterns rather than actually doing the calculation.

5. The Solution: Don't Ask the AI to Do Everything

The paper proposes a clever fix. Instead of asking the AI to "Read this whole spreadsheet and do the math," we should split the job:

The AI as the Scanner: Ask the AI to just find the numbers (e.g., "What is the debt for Company X?"). The AI is actually very good at this simple lookup.
The Computer as the Calculator: Once the AI finds the numbers, pass them to a standard computer program (like a calculator or Python script) to do the actual math, sorting, and averaging.

The Analogy:
Think of the AI as a super-fast intern and the computer as a calculator.

Current Approach: You ask the intern to read the ledger, find the numbers, and do the long division in their head. They get tired and make mistakes.
Proposed Approach: You ask the intern to just find the numbers and write them down. Then, you hand the list to the calculator to do the math. The intern never has to do the math, so they don't make mistakes.

The Bottom Line

AI is getting incredibly good at reading financial documents, but it is not ready to work alone in a bank or investment firm yet. It's too prone to calculation errors.

To make it useful, we need to stop treating the AI as a "brain" that does everything, and start treating it as a "scanner" that finds data, which we then feed into a reliable calculator. Until we build that bridge, humans will still need to double-check every single number.

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

1. The Problem: The "Messy Ledger"

2. The Test: "FinSheet-Bench"

3. The Results: The "Smart but Clumsy" AI

4. Why Do They Fail?

5. The Solution: Don't Ask the AI to Do Everything

The Bottom Line

1. Problem Statement

2. Methodology

A. Dataset Construction: FinSheet-Bench

B. Experimental Setup

3. Key Results

A. Overall Performance

B. The Complexity Gap

C. Impact of Reasoning

D. File Complexity

4. Key Contributions

5. Significance and Future Directions

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

1. The Problem: The "Messy Ledger"

2. The Test: "FinSheet-Bench"

3. The Results: The "Smart but Clumsy" AI

4. Why Do They Fail?

5. The Solution: Don't Ask the AI to Do Everything

The Bottom Line

1. Problem Statement

2. Methodology

A. Dataset Construction: FinSheet-Bench

B. Experimental Setup

3. Key Results

A. Overall Performance

B. The Complexity Gap

C. Impact of Reasoning

D. File Complexity

4. Key Contributions

5. Significance and Future Directions

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers