Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

This paper introduces novel "Text-to-Big SQL" evaluation metrics to address the limitations of existing benchmarks in assessing production-level LLM agents, demonstrating that traditional Text-to-SQL metrics fail to capture critical cost, latency, and efficiency implications that arise when scaling to large datasets.

Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you are a chef (the AI Agent) trying to cook a meal based on a customer's order written in plain English (the Text-to-SQL problem).

For years, researchers have tested chefs by giving them recipes for small, home-cooked meals. They only cared about one thing: Did the final dish taste exactly like the picture on the menu? If the chef added an extra pinch of salt or a garnish the customer didn't ask for, the dish was marked as "failed."

But in the real world, big restaurants (like Big Data systems) don't just care about taste. They care about the cost of ingredients, the time it takes to cook, and the waste generated.

This paper, titled "Both Ends Count!", argues that we've been testing AI chefs with the wrong rules. Here is the breakdown in simple terms:

1. The Problem: The "Small Pot" vs. The "Industrial Cauldron"

In the past, AI was tested on small databases (like a home pantry). If an AI wrote a SQL query (a cooking instruction) that was slightly wrong, it just took a second to fix. It was cheap and fast.

But today, companies use Big Data (massive industrial cauldrons).

  • The Cost of a Mistake: If an AI writes a query that scans the wrong data in a massive system, it doesn't just take a second; it might take hours and cost hundreds of dollars in cloud computing fees.
  • The "Extra Ingredient" Issue: If an AI adds one extra column of data that isn't needed, a small system ignores it. But in a massive system, that extra column means scanning gigabytes of unnecessary data, burning money and time.

The Analogy:
Imagine you ask a robot to "Find me the price of apples."

  • Small System: The robot accidentally lists the price of oranges too. You just ignore the oranges. No big deal.
  • Big System: The robot has to physically drive a truck to a warehouse 1,000 miles away to check the price. If it checks the oranges and the apples, it burns double the gas. The mistake isn't just "wrong"; it's expensive.

2. The New Solution: "Text-to-Big SQL"

The authors say we need a new way to grade these AI chefs. They call it "Text-to-Big SQL." Instead of just asking "Is the answer right?", they ask:

  • How much did it cost? (Did the robot drive the truck unnecessarily?)
  • How long did it take? (Did the robot spend 10 minutes thinking before it started cooking?)
  • Was it efficient? (Did it bring back extra stuff we didn't need?)

They created new "report cards" (metrics) called VES* and VCES.

  • Old Report Card: "Pass/Fail." (Did you get the right answer?)
  • New Report Card: "Pass/Fail + Cost + Speed." (Did you get the right answer, and did you do it without burning the budget?)

3. What They Found (The Plot Twist)

The researchers tested the world's smartest AI models (like GPT-4o, Claude Opus, and Gemini) using these new rules. Here is what they discovered:

  • Accuracy is a Trap: Some models were "perfect" at getting the right answer but took 10 times longer to think about it. In a big data world, being 100% accurate but slow is actually a failure because you wasted money waiting.
  • The "Fast and Cheap" Winner: Some models were slightly less accurate but much faster and cheaper. In the real world, these might actually be the better choice because you can afford to run them twice if they get it wrong, rather than paying for one slow, expensive run.
  • The "Thinking" Bottleneck: The AI spends a lot of time "thinking" (reasoning) and talking to its tools before it even runs the query. In Big Data, if the AI takes longer to think than the computer takes to run the query, the whole system is broken.

4. The "Scale" Effect

The paper shows that as data gets bigger, the stakes get higher.

  • Small Data: A 10% error rate in the AI is annoying.
  • Huge Data: That same 10% error rate becomes a financial disaster. If the AI fails 1 out of 10 times, and each failure costs $100, you are bleeding money.

The authors found that standard tests completely miss this. They are like judging a Formula 1 car by how well it drives in a parking lot. Just because it works in the lot doesn't mean it won't crash on the highway.

Summary: The Takeaway

This paper is a wake-up call. We have been treating AI like a student taking a math test (Right or Wrong). But in the real world of Big Data, AI is more like a logistics manager.

  • Old Way: "Did you deliver the package?" (Yes/No)
  • New Way: "Did you deliver the package, did you use the most fuel-efficient route, and did you arrive on time without burning the budget?"

The authors argue that to make AI useful for big businesses, we need to stop grading them on "perfect answers" and start grading them on efficiency, cost, and speed. If an AI is 90% accurate but saves you 50% of your money, it's a better chef than the one who is 100% accurate but bankrupts the kitchen.