Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Imagine you are a chef (the AI Agent) trying to cook a meal based on a customer's order written in plain English (the Text-to-SQL problem).

For years, researchers have tested chefs by giving them recipes for small, home-cooked meals. They only cared about one thing: Did the final dish taste exactly like the picture on the menu? If the chef added an extra pinch of salt or a garnish the customer didn't ask for, the dish was marked as "failed."

But in the real world, big restaurants (like Big Data systems) don't just care about taste. They care about the cost of ingredients, the time it takes to cook, and the waste generated.

This paper, titled "Both Ends Count!", argues that we've been testing AI chefs with the wrong rules. Here is the breakdown in simple terms:

1. The Problem: The "Small Pot" vs. The "Industrial Cauldron"

In the past, AI was tested on small databases (like a home pantry). If an AI wrote a SQL query (a cooking instruction) that was slightly wrong, it just took a second to fix. It was cheap and fast.

But today, companies use Big Data (massive industrial cauldrons).

The Cost of a Mistake: If an AI writes a query that scans the wrong data in a massive system, it doesn't just take a second; it might take hours and cost hundreds of dollars in cloud computing fees.
The "Extra Ingredient" Issue: If an AI adds one extra column of data that isn't needed, a small system ignores it. But in a massive system, that extra column means scanning gigabytes of unnecessary data, burning money and time.

The Analogy:
Imagine you ask a robot to "Find me the price of apples."

Small System: The robot accidentally lists the price of oranges too. You just ignore the oranges. No big deal.
Big System: The robot has to physically drive a truck to a warehouse 1,000 miles away to check the price. If it checks the oranges and the apples, it burns double the gas. The mistake isn't just "wrong"; it's expensive.

2. The New Solution: "Text-to-Big SQL"

The authors say we need a new way to grade these AI chefs. They call it "Text-to-Big SQL." Instead of just asking "Is the answer right?", they ask:

How much did it cost? (Did the robot drive the truck unnecessarily?)
How long did it take? (Did the robot spend 10 minutes thinking before it started cooking?)
Was it efficient? (Did it bring back extra stuff we didn't need?)

They created new "report cards" (metrics) called VES* and VCES.

Old Report Card: "Pass/Fail." (Did you get the right answer?)
New Report Card: "Pass/Fail + Cost + Speed." (Did you get the right answer, and did you do it without burning the budget?)

3. What They Found (The Plot Twist)

The researchers tested the world's smartest AI models (like GPT-4o, Claude Opus, and Gemini) using these new rules. Here is what they discovered:

Accuracy is a Trap: Some models were "perfect" at getting the right answer but took 10 times longer to think about it. In a big data world, being 100% accurate but slow is actually a failure because you wasted money waiting.
The "Fast and Cheap" Winner: Some models were slightly less accurate but much faster and cheaper. In the real world, these might actually be the better choice because you can afford to run them twice if they get it wrong, rather than paying for one slow, expensive run.
The "Thinking" Bottleneck: The AI spends a lot of time "thinking" (reasoning) and talking to its tools before it even runs the query. In Big Data, if the AI takes longer to think than the computer takes to run the query, the whole system is broken.

4. The "Scale" Effect

The paper shows that as data gets bigger, the stakes get higher.

Small Data: A 10% error rate in the AI is annoying.
Huge Data: That same 10% error rate becomes a financial disaster. If the AI fails 1 out of 10 times, and each failure costs $100, you are bleeding money.

The authors found that standard tests completely miss this. They are like judging a Formula 1 car by how well it drives in a parking lot. Just because it works in the lot doesn't mean it won't crash on the highway.

Summary: The Takeaway

This paper is a wake-up call. We have been treating AI like a student taking a math test (Right or Wrong). But in the real world of Big Data, AI is more like a logistics manager.

Old Way: "Did you deliver the package?" (Yes/No)
New Way: "Did you deliver the package, did you use the most fuel-efficient route, and did you arrive on time without burning the budget?"

The authors argue that to make AI useful for big businesses, we need to stop grading them on "perfect answers" and start grading them on efficiency, cost, and speed. If an AI is 90% accurate but saves you 50% of your money, it's a better chef than the one who is 100% accurate but bankrupts the kitchen.

Here is a detailed technical summary of the paper "Both Ends Count! Just How Good are LLM Agents at Text-to-'Big SQL'?"

1. Problem Statement

The paper addresses a critical gap in the evaluation of Text-to-SQL systems when deployed in Big Data environments. While Text-to-SQL is a mature field with extensive benchmarks (e.g., Spider, BIRD), these benchmarks primarily focus on traditional relational databases with moderate data scales. They rely heavily on binary correctness metrics (Exact Match, Execution Accuracy) that treat any deviation from the "golden" SQL query as a total failure.

In Big Data contexts (e.g., Amazon Athena, Spark, BigQuery), this evaluation paradigm is insufficient because:

Cost Amplification: A minor translation error (e.g., selecting an unnecessary column) that is negligible on a small dataset can result in massive compute costs and latency when scanning terabytes of data.
Latency Sensitivity: In interactive analytics, the time taken for the LLM agent to reason, orchestrate tools, and generate SQL must be balanced against the physical query execution time. If generation is too slow, the system becomes impractical.
Partial Correctness: Traditional metrics penalize queries that return correct results but with extra columns. In Big Data, a user can often manually drop extra columns, making such queries "partially valid" but costly.

The authors argue that existing metrics fail to capture the trade-offs between accuracy, latency, and cost at scale, leading to a new paradigm they term "Text-to-Big SQL."

2. Methodology

The authors propose a novel evaluation framework that treats both SQL generation (the agent's reasoning) and SQL execution (the Big Data engine's performance) as first-class citizens.

A. Agent Architecture

Framework: They utilize a ReAct (Reasoning + Acting) agent architecture built on LangGraph.
Components:
- Controller: An LLM that reasons about the task, decides tool usage, and generates the final SQL.
- Executor: Manages the interaction loop with external tools.
- Tools: Four specific tools for Spark SQL: list_tables, get_schema, check_query (syntax validation), and run_query.
Constraints: The agent is terminated after the first run_query execution to prevent infinite loops and excessive billing costs, simulating a "zero-shot" production scenario without iterative self-correction loops.

B. Datasets and Models

Benchmarks:
- BIRD: Used for evaluating translation accuracy on realistic databases.
- TPC-H: Used for evaluating performance across varying Scale Factors (SF) (10, 100, 1000) to simulate Big Data growth.
Models: A diverse set of frontier LLMs (e.g., GPT-4o, GPT-5, Claude Opus 4.5/4.6, Gemini 3 Flash/Pro, GLM-5) were tested using their official APIs with standardized hyperparameters.

C. Proposed Metrics

The paper introduces three new metrics to replace or augment traditional ones:

VES* (Valid Efficiency Score Star): An extension of the Valid Efficiency Score (VES). It incorporates Column-Level Precision ( $P$ ) to penalize superfluous columns without discarding partially correct results.
$VES^* = \frac{1}{N} \sum \left( \mathbb{1}(V, \hat{V}) \cdot P(S, \hat{S}) \cdot \frac{T_{gold}}{T_{e2e}} \right)$
- $\mathbb{1}(V, \hat{V})$ : Binary indicator for result correctness.
- $P(S, \hat{S})$ : Precision of retrieved columns (Ground Truth vs. Generated).
- $T_{e2e}$ : Total end-to-end time (LLM reasoning + tool calls + query execution).
VCES (Valid Cost-Efficiency Score): A cost-oriented derivative of VES* that factors in the total execution cost ( $C_{e2e}$ ), including token usage and compute charges.
$VCES = \frac{1}{N} \sum \left( \mathbb{1}(V, \hat{V}) \cdot P(S, \hat{S}) \cdot \frac{T_{gold}}{T_{e2e} \cdot C_{e2e}} \right)$
CVQ (Cost per Valid Query): Quantifies the expected cost to obtain a valid result under a "retry-until-success" strategy. It highlights how accuracy gaps are amplified at larger data scales.
$CVQ = \frac{C_{e2e}}{p}$
- Where $p$ is the single-shot validity rate.

3. Key Contributions

Novel Evaluation Framework: The first benchmarking methodology for Text-to-Big SQL that jointly assesses agent interaction, reasoning latency, and query execution cost.
New Metrics (VES*, VCES, CVQ): Metrics that account for partial correctness (superfluous columns) and the economic impact of errors in Big Data environments.
Systematic Model Analysis: A comprehensive evaluation of state-of-the-art LLMs revealing that higher accuracy does not always equate to better performance in Big Data settings.
Scale-Aware Insights: Demonstration that data scale fundamentally changes the optimization landscape, where latency and cost trade-offs become critical.

4. Key Results

Accuracy is Insufficient: Standard metrics (like Execution Accuracy) fail to differentiate models when accuracy is high. For instance, Claude Opus 4.6 achieved perfect accuracy but took 92% longer to execute than GPT-4o, making it less suitable for interactive analytics.
VES* Discriminates Better: VES* successfully ranked models by combining speed and partial correctness. GPT-4o emerged as the top performer due to its balance of speed and accuracy, whereas models like Gemini 3 Pro and GLM-5 suffered from high latency variance.
Cost vs. Latency Trade-off:
- Gemini 3 Flash was identified as the most cost-efficient model due to low token pricing, despite being slightly slower than GPT-4o.
- GPT-4o was the fastest but significantly more expensive per query, especially when accounting for failed retries.
Impact of Data Scale:
- At small scales (SF 10), agent reasoning time dominates the total latency.
- At large scales (SF 1000), query execution time dominates.
- CVQ Analysis: As data scale increases, the cost of errors explodes. A model with 10% lower accuracy (e.g., Opus 4.5 vs. GPT-5.2) becomes exponentially more expensive at SF 1000 compared to SF 10, a nuance invisible to traditional metrics.
Error Taxonomy: Analysis of 1,730 errors revealed that 38.9% were "Output Format" errors (mostly extra columns). Under traditional metrics, these are failures; under the proposed Text-to-Big SQL metrics, they are penalized but recognized as partially valid.

5. Significance and Future Directions

Paradigm Shift: The paper argues that Text-to-SQL evaluation must evolve from "Is the SQL correct?" to "Is the SQL correct, fast, and cheap at scale?"
Production Relevance: The findings are critical for deploying AI agents in cloud data warehouses (e.g., Snowflake, BigQuery, Athena) where billing is usage-based.
Future Opportunities:
- Strategic Model Assignment: Using different models for different agent stages (e.g., a "fast/cheap" model for schema inspection and a "smart" model for complex reasoning).
- Scale-Aware Optimization: Leveraging historical execution traces to predict costs and rewrite queries before execution.
- Approximate Query Processing: Integrating user intent for "fast" results (e.g., sampling) rather than exact execution.
- UDF Integration: Handling User-Defined Functions which are common in Big Data but difficult for standard SQL generators.

In conclusion, "Both Ends Count!" establishes that evaluating LLM agents for Big Data requires a holistic view of the entire pipeline. Ignoring the "Big" in Big SQL leads to systems that are theoretically accurate but practically unusable due to cost and latency constraints.