Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL

This paper demonstrates that reasoning Large Language Models significantly reduce cloud query execution costs and data consumption compared to non-reasoning models in Text-to-SQL tasks, while revealing that execution time is a poor proxy for cost efficiency and highlighting the substantial financial risks posed by non-reasoning models' tendency to generate inefficient queries.

Saurabh Deochake, Debajyoti Mukhopadhyay

Published 2026-03-10
📖 5 min read🧠 Deep dive

Here is an explanation of the paper, translated from "academic speak" into everyday language with some creative analogies.

🧠 The Big Idea: Speed vs. Price

Imagine you hire two different chefs to cook a meal for a party.

  • Chef A (The "Reasoning" Chef) takes a long time to think, taste, and plan the recipe before cooking.
  • Chef B (The "Standard" Chef) rushes to the stove immediately, throwing ingredients together as fast as possible.

In the world of AI and databases, we usually care about how fast the food is ready (Execution Time). But this paper asks a different question: "How much did the ingredients cost?"

The authors found that while the "Standard" chefs are fast, they often waste massive amounts of expensive ingredients. The "Reasoning" chefs, who take a moment to think first, actually save you a lot of money, even if they take a tiny bit longer to start cooking.


🏢 The Setting: The Cloud Kitchen

The researchers tested this in a "Cloud Kitchen" called Google BigQuery.

  • The Menu: They used a massive dataset called StackOverflow (230 GB of data), which is like a library containing every question and answer ever posted on the site.
  • The Bill: In this kitchen, you don't pay by the hour. You pay by how much food you eat. If you scan a whole table of data but only need one column, you pay for the whole table. It's like paying for a whole pizza just because you wanted one slice.

🧪 The Experiment

The team asked 6 different AI chefs (3 "Reasoning" models and 3 "Standard" models) to write 30 different SQL queries (recipes) to get specific information from that 230 GB library.

They measured:

  1. Correctness: Did the recipe actually make the dish?
  2. Speed: How fast was it served?
  3. Cost: How many bytes of data did the AI scan to get the answer? (This is the money metric).

🔑 The Big Discoveries

1. The "Thinker" Chefs Save Money

The Reasoning Models (the ones that pause to think) were 44.5% cheaper to run than the Standard Models.

  • Analogy: Imagine you need to find a specific book in a library.
    • The Standard Model runs into the library, grabs every single book off the shelves, and starts flipping through them until it finds the one. It's fast to start, but it moves a mountain of books (high cost).
    • The Reasoning Model walks to the desk, asks the librarian exactly which shelf the book is on, and walks straight there. It moves fewer books (low cost).
  • Result: The Reasoning models scanned significantly less data, saving money, while still getting the right answer 96%–100% of the time.

2. Speed is a Trap (The "Fast & Expensive" Illusion)

The biggest surprise was that speed does not equal savings.

  • The Correlation: The paper found a very weak link (0.16) between how fast a query ran and how much it cost.
  • Analogy: Imagine a race car driver who drives 200 mph but gets stuck in a traffic jam of 100 miles. They finish the race quickly, but they burned a fortune in gas.
  • Reality: A query can finish in 2 seconds because the cloud computer is super powerful (parallel processing), but if it scanned 36 GB of data to do it, it cost a fortune. Don't trust speed as a sign of efficiency.

3. The "Wild Card" Problem

Some of the Standard Models were unpredictable.

  • The Outlier: One model (GPT-5.1) had a "bad day" where it generated a query that scanned 36 GB of data for a single question. That's 20 times more than the best model!
  • Why? It forgot to filter by date (missing partition filters) or grabbed unnecessary columns (like SELECT *).
  • Analogy: It's like asking a waiter, "How many people ordered pizza?" and the waiter goes to every single table in the restaurant, asks everyone what they ate, and brings you a list of 500 people, even though only 5 ordered pizza.

4. The Mistakes They Made

The researchers found common "bad habits" in the Standard models:

  • SELECT * (The "Grab Everything" habit): Instead of asking for just the "Name" column, the AI asked for the "Name," "Address," "Phone," "Email," and "Full Biography" columns.
  • Missing Filters: Asking for "All questions ever" instead of "Questions from 2023."
  • Cartesian Products: Accidentally creating a "Cross Join," which is like taking every user and pairing them with every single post, creating a massive, useless list of combinations.

💡 What Should You Do? (The Takeaway)

If you are building an AI system that talks to databases (Text-to-SQL) for a business, here is the advice from the paper:

  1. Pick the "Thinkers": Use Reasoning Models. They might cost a tiny bit more to think (inference cost), but they save you a huge amount of money on execution (database cost).
  2. Stop Watching the Clock: Don't just look at how fast the query runs. A fast query can still bankrupt your budget. Look at how much data was scanned.
  3. Install Safety Guards: Set up automatic rules that say, "If a query tries to scan more than 10 GB, stop it!" This prevents the AI from accidentally ordering a 36 GB pizza.
  4. Check for Bad Habits: Make sure the AI isn't using SELECT * or forgetting to filter by dates.

🏁 In a Nutshell

This paper proves that in the cloud, being smart is cheaper than being fast. The AI models that take a moment to "think" before they speak generate cleaner, more efficient code that saves companies real money, while the "fast" models often waste resources by scanning way more data than necessary.