SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

Imagine you are a chef trying to teach a robot how to cook. You give the robot a recipe written in plain English (like "Make a salad with fresh tomatoes and cucumbers"). The robot translates this into a specific set of instructions for a machine (like a SQL query for a database).

To see if the robot did a good job, you usually taste the salad. If the salad looks and tastes like the "Gold Standard" salad (the one made by a human expert), you say, "Great job!" and give the robot a high score.

The Problem with Just Tasting:
The paper argues that this "tasting" method is flawed. Imagine the Gold Standard salad has 3 tomatoes and 2 cucumbers. The robot makes a salad with 3 tomatoes and 2 cucumbers. They taste the same on that specific day. But what if the robot actually used 300 tomatoes and 200 cucumbers, but the kitchen only had 3 tomatoes and 2 cucumbers available that day? The robot got away with a mistake because the test kitchen was too small to reveal the error.

In the world of databases, this happens all the time. Two different computer queries can produce the exact same result on a small, static test database, even though they are logically completely different. Current evaluation methods are like tasting a salad in a tiny kitchen; they miss the big mistakes.

The Solution: SPOTIT (The "What-If" Detective)
The authors created a new tool called SPOTIT. Instead of just tasting the salad, SPOTIT is like a super-smart detective who asks, "Is there any possible kitchen in the universe where these two recipes would produce different results?"

The Search: SPOTIT uses a powerful mathematical engine (called formal verification) to actively search for a "counter-example." It tries to build a hypothetical database (a different kitchen) where the robot's recipe and the human's recipe would fail to match.
The Verdict:
- If SPOTIT finds a kitchen where they differ, it proves the robot is wrong (or the human's recipe is wrong).
- If SPOTIT searches every possible small kitchen and can't find a difference, it gives the robot a "Gold Star" of true equivalence.

What They Discovered (The Plot Twist)
When the researchers used SPOTIT to re-evaluate the top 10 robot chefs (Text-to-SQL models) on a famous dataset called BIRD, they found some shocking things:

The Scores Dropped: The robots looked much less impressive. Many queries that were previously marked "Correct" were actually "Wrong" because they only worked by luck on the small test data. The accuracy of these robots dropped by about 11–14%.
The Human Was Wrong Too: This is the biggest surprise. SPOTIT found that in many cases, the robot was actually right, and the "Gold Standard" human recipe was wrong! The human annotators had made mistakes in their own SQL code, but because the test data was small, no one noticed.
The Question Was Confusing: Sometimes, the natural language question was ambiguous (e.g., "How many members?" could mean "everyone in the club" or "only people with the title 'Member'"). Both the robot and the human gave different answers, and both could be right depending on how you interpreted the question.

Why This Matters
Think of the current way we test AI as a multiple-choice quiz where the answers are pre-filled. If the AI guesses the right letter, it gets a point. But SPOTIT is like asking the AI to prove why its answer is right, or to find a scenario where it would be wrong.

The Takeaway:
The paper tells us that we can't just trust the current "taste tests" for AI. We need to use "What-If" detectives like SPOTIT to ensure that AI is truly understanding the logic, not just memorizing the answers for a specific, tiny test. It turns out, the "Gold Standard" answers we've been using for years are full of hidden errors, and the AI might actually be smarter than we thought in some cases!

Here is a detailed technical summary of the paper "SPOTIT: Evaluating Text-to-SQL Evaluation with Formal Verification".

1. Problem Statement

Current Text-to-SQL evaluation relies heavily on test-based methods. In this paradigm, a generated SQL query ( $P$ ) is executed against a static, pre-defined test database ( $D_{test}$ ), and its output is compared to the output of a human-labeled "gold" SQL query ( $Q$ ). If the results match, the generated query is deemed correct.

The Core Limitation: This approach is inherently optimistic. Two structurally different SQL queries can coincidentally produce identical results on a specific, static dataset due to the particular distribution of data. Consequently, test-based evaluation may falsely label incorrect queries as correct (false positives) or fail to detect subtle semantic differences. Furthermore, existing benchmarks often contain errors in the gold SQLs themselves or ambiguous natural language questions, which static testing cannot systematically diagnose.

The paper asks: How often does a test-based "correct" label actually represent true semantic equivalence, and how reliable are current benchmarks?

2. Methodology: SPOTIT

The authors propose SPOTIT, a new evaluation pipeline that replaces static testing with formal bounded equivalence verification. Instead of checking if $P$ and $Q$ match on one database, SPOTIT actively searches for a counterexample database ( $D_{cex}$ ) where $P$ and $Q$ produce different results.

Key Technical Components:

Bounded Equivalence Checking:
- Since general SQL equivalence is undecidable, SPOTIT uses SMT-based bounded verification. It checks if $P \equiv Q$ for all databases where table sizes are bounded by an integer $K$ .
- If the queries are not equivalent within bound $K$ , the SMT solver returns a satisfying interpretation, which is decoded into a minimal counterexample database ( $D_{cex}$ ).
Extension of VERIEQL:
- The system builds upon VERIEQL, a state-of-the-art bounded SQL verifier.
- Novel SMT Encodings: The authors significantly extended VERIEQL to support complex SQL operators frequently found in Text-to-SQL benchmarks but previously unsupported by formal verifiers:
  - Date Operations: Precise encoding of dates as triplets $(year, month, day)$ with constraints for leap years and valid ranges, supporting functions like STRFTIME, JulianDay, and DateShift.
  - String Operations: Support for PrefixOf, SuffixOf, Like, Contain, and SubStr.
  - Type Casting: Handling implicit type conversions between integers, strings, dates, and NULLs (e.g., ToInt, ToDate, ToStr).
- Set Semantics: The system encodes equivalence under set semantics (ignoring duplicates), which is the standard for benchmarks like BIRD, rather than bag or list semantics.
The SPOTIT Pipeline:
- Phase 1 (Input): Takes a Natural Language query, the Gold SQL, and a generated SQL.
- Phase 2 (Verification): Iteratively increases the bound $K$ (from 1 to 5) and runs the SMT solver to find a $D_{cex}$ .
- Phase 3 (Validation): Executes both queries on the candidate $D_{cex}$ using a real database engine (SQLite) to filter out "spurious" counterexamples caused by SMT over-approximation or non-deterministic behaviors.
- Cross-Checking: The system reuses counterexamples found for one model to test others, improving efficiency.

3. Key Contributions

SPOTIT Framework: The first evaluation pipeline for Text-to-SQL powered by formal equivalence verification, providing stronger correctness guarantees than static testing.
Formal Extensions: Novel SMT encodings for string and date manipulation operators, proving their correctness against concrete execution semantics.
Large-Scale Empirical Study: A comprehensive evaluation of 10 state-of-the-art Text-to-SQL methods on the BIRD dataset (1,533 questions).
Benchmark Analysis: A systematic discovery of flaws in current evaluation practices, including incorrect gold SQLs and ambiguous questions.

4. Experimental Results

The authors evaluated 10 top-performing models (e.g., Alpha-SQL, OmniSQL, GenaSQL) on the BIRD development set.

Accuracy Drop: When switching from the official test-based metric (EX-TEST) to SPOTIT, the reported accuracy of all methods dropped significantly:
- Decrease Range: 11.3% to 14.2% absolute drop in accuracy.
- Ranking Changes: The relative ranking of models changed drastically. For instance, the top-ranked model (CSC-32B) dropped from 1st to 4th place under SPOTIT+.
Counterexample Discovery: SPOTIT found differentiating databases for hundreds of queries that passed the official test.
- Validation Rate: ~95% of the counterexamples found by the SMT solver were validated as "true" (non-spurious) by the backend database engine.
- Runtime: Finding a counterexample typically took less than 4 seconds per query.
Analysis of Discrepancies: Manual inspection of 50 counterexamples revealed the root causes of mismatches:
- Incorrect Gold SQLs: The most frequent cause. In many cases, the human-labeled ground truth was logically flawed (e.g., incorrect boolean logic in WHERE clauses, wrong date comparisons).
- Ambiguous Questions: Natural language queries that admit multiple valid interpretations.
- Incorrect Generated SQLs: Genuine errors in the model's output.
Spider 2.0 Evaluation: SPOTIT was also applied to the Spider 2.0 benchmark, revealing similar discrepancies, though support was limited by the lack of window function support in the verifier.

5. Significance and Implications

Overestimation of Performance: Current Text-to-SQL benchmarks significantly overestimate model performance because static test databases are too small and specific to expose semantic differences.
Benchmark Quality: The study exposes a critical issue: Gold SQLs are often incorrect. In cases where models disagreed with the gold SQL, the gold SQL was frequently the source of the error. This suggests that even a "perfect" Text-to-SQL model might not achieve 100% accuracy on current benchmarks due to flawed ground truth.
New Evaluation Standard: SPOTIT offers a rigorous, formal alternative to testing. It can systematically identify problematic gold SQLs and ambiguous questions, aiding in the curation of higher-quality datasets.
Verification Community: The work demonstrates that SMT-based verification is practical for real-world SQL fragments, encouraging further investment in covering larger SQL subsets (e.g., window functions, recursive CTEs) to make formal verification a standard tool for database systems.

In conclusion, SPOTIT shifts the paradigm from "does it work on this specific data?" to "is it logically equivalent to the intended query?", revealing that the current state of Text-to-SQL evaluation is less robust than previously believed.

SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

1. Problem Statement

2. Methodology: SPOTIT

Key Technical Components:

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks