SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Imagine you are a teacher grading a student's homework. The assignment is to write a computer program (specifically, a SQL query) that answers a question about a database, like "Show me all customers who live in New York."

The student submits their answer. To grade it, you have two main ways to check if they are right:

1. The Old Way: The "Sample Test" (Test-Based Evaluation)

This is how most people currently check SQL answers. You take the student's code and run it on one specific, small list of fake data that you prepared in advance.

The Analogy: Imagine you ask a student to sort a deck of cards. You hand them a specific deck of 52 cards. They sort it, and it looks perfect. You give them an A.
The Problem: What if the student used a weird trick that only works on that specific deck? If you gave them a different deck (maybe one with a missing card or a duplicate), their trick would fail, but you'd never know because you only tested the one deck. In the world of databases, this means a student might get an "A" for a query that is actually broken, just because it got lucky with the specific data you gave them.

2. The New Way: The "Logic Detective" (Verification-Based Evaluation)

This is what the paper's authors, Andrew and his team, are doing with their new tool called SpotIt+. Instead of just running the code on one list of data, they ask a super-smart logic engine: "Is there ANY possible list of data in the universe where this student's answer would be different from the correct answer?"

The Analogy: Instead of just checking the one deck of cards, the teacher asks a detective, "Can you imagine any deck of cards where this sorting method fails?" If the detective finds even one weird deck where the student's code breaks, the student gets a failing grade. This is much stricter and fairer.

The Big Problem with the "Logic Detective"

The "Logic Detective" is so smart that it can find any difference, even differences that would never happen in real life.

The Analogy: Imagine the detective finds a counterexample where a person's age is -500 years or their name is a string of random symbols like @#$%. The detective says, "Aha! Your code fails if someone is -500 years old!"
The Issue: While technically true, that's a silly scenario. In the real world, people aren't -500 years old. If we use these "silly" counterexamples to grade students, we might unfairly punish them for not handling impossible situations. We want the detective to find realistic mistakes, not impossible ones.

The Solution: SpotIt+ (The "Real-World Filter")

This is where SpotIt+ comes in. It adds a special step to the process: Constraint Mining.

Learning the Rules: Before the detective starts searching, SpotIt+ looks at the example data (the "training deck") and learns the unwritten rules of the real world.
- It learns: "Heads are always red, tails are always black."
- It learns: "People's ages are between 0 and 120."
- It learns: "If a customer has an ID, they must have a name."
The LLM Validator (The "Sense Check"): Sometimes, the rules SpotIt+ finds are too strict (e.g., "Everyone in the database is exactly 30 years old"). A Large Language Model (LLM)—basically a very smart AI—steps in to say, "Wait, that rule is too specific to this one list of data. In the real world, people can be 20 or 40 too. Let's relax that rule."
The Final Check: Now, the Logic Detective searches for mistakes, but it is only allowed to imagine realistic scenarios. It can't use -500-year-olds or impossible names.

Why This Matters

The paper tested this new tool on the BIRD dataset (a huge collection of real-world database questions).

The Result: SpotIt+ found that many "A" grades given by the old "Sample Test" method were actually wrong. The students' code worked on the sample data but would fail in the real world.
The Benefit: By using realistic constraints, SpotIt+ catches the real bugs without wasting time on impossible bugs. It's like a teacher who doesn't just check if the answer is right on one test, but checks if the student truly understands the subject so they can handle any real-world situation.

Summary

Old Method: "Does it work on this one specific list?" (Too easy, misses hidden bugs).
Basic Verification: "Does it work on any list, even impossible ones?" (Too hard, punishes students for silly edge cases).
SpotIt+: "Does it work on any realistic list?" (Just right. It finds the real mistakes that matter).

SpotIt+ is essentially a smarter, fairer grading system for AI that writes database code, ensuring that when we say an AI is "correct," it actually works in the messy, real world.

Here is a detailed technical summary of the paper "SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints."

1. Problem Statement

The evaluation of Text-to-SQL systems (converting natural language questions into database queries) currently relies heavily on test-based execution. In this approach, a generated SQL query ( $P$ ) and a ground-truth "gold" query ( $Q$ ) are executed on a fixed test database ( $D$ ). If the results match, the generated query is deemed correct.

Key Limitations of Current Methods:

False Positives: Two non-equivalent queries may return identical results on a specific test database instance due to the limited data distribution. This leads to overly optimistic accuracy metrics.
Unrealistic Counterexamples: Previous work introduced verification-based evaluation (using SMT solvers to find database instances where $P$ and $Q$ differ). However, without constraints, these solvers often generate "pathological" counterexamples (e.g., using impossible integer values, nulls where they shouldn't exist, or violating domain logic). These counterexamples do not reflect real-world data distributions, making the evaluation less practical.

2. Methodology: SpotIt+

The authors propose SpotIt+, an open-source tool that enhances bounded equivalence verification by integrating a constraint-mining pipeline. The workflow consists of three main stages:

A. Bounded Equivalence Verification

SpotIt+ uses an SMT-based engine (specifically VeriEQL) to check if $P$ and $Q$ are equivalent within a bounded search space (e.g., databases with at most $K$ rows per table).

Goal: Systematically search for a database instance $D_{cex}$ where $P(D_{cex}) \neq Q(D_{cex})$ .
Outcome: Either a proof of equivalence within the bound or a concrete counterexample database.

B. Constraint Extraction Pipeline

To ensure counterexamples are realistic, SpotIt+ extracts constraints from the example database provided in the benchmark and encodes them into the verification query. It extracts five types of constraints:

Range Constraints: Numeric bounds (e.g., $min \leq col \leq max$ ).
Categorical Constraints: Finite sets of allowed values for a column (e.g., $col \in \{A, B, C\}$ ).
NotNull Constraints: Columns that cannot contain null values.
Functional Dependencies: Relationships where one set of columns uniquely determines another ( $X \to Y$ ).
Ordering Dependencies: Inequalities between numeric columns (e.g., $col_1 \leq col_2$ ).

C. LLM-Based Validation and Repair

A critical innovation is the use of a Large Language Model (LLM) to validate and repair the mined constraints:

Validation: The LLM assesses whether a mined constraint represents a genuine domain property or an artifact of the specific test data (e.g., distinguishing between a true functional dependency like Country Code $\to$ Country Name vs. a spurious one like First Name $\to$ Occupation).
Repair: For range constraints, the LLM checks if the observed bounds are too restrictive. For example, if a test database only contains ages 30–60, the LLM might suggest relaxing the constraint to a realistic domain range like [0, 120] to avoid eliminating valid counterexamples.

3. Key Contributions

SpotIt+ Tool: An open-source, verification-based evaluation framework that moves beyond simple execution testing.
Hybrid Constraint Pipeline: A novel method combining rule-based mining (to extract constraints from data) with LLM-based validation (to ensure semantic correctness and generalizability).
Empirical Validation: A comprehensive evaluation on the BIRD dataset (1,533 questions across 11 domains) demonstrating that constraint-aware verification yields more realistic counterexamples without sacrificing the ability to detect errors.

4. Experimental Results

The authors evaluated 10 state-of-the-art Text-to-SQL methods (including Alpha-SQL, OmniSQL, and GenaSQL) on the BIRD benchmark.

Discrepancy Detection: Verification-based methods (SpotIt, SpotIt+-noV, SpotIt+) identified significantly more discrepancies than the standard test-based metric (EX-test). Many queries deemed "correct" by execution were found to be non-equivalent under verification.
Realism of Counterexamples:
- SpotIt (No Constraints): Found many counterexamples, but they often used unrealistic values (e.g., placeholder integers like 2147483648).
- SpotIt+-noV (Constraints Only): Eliminated unrealistic counterexamples but sometimes missed genuine discrepancies due to overly strict constraints derived from limited data.
- SpotIt+ (LLM Validated): Achieved the best balance. It filtered out pathological cases while recovering genuine discrepancies that were missed by the strict "noV" version (e.g., by relaxing an age range constraint).
Efficiency: The tool is highly efficient. The average time to find a counterexample was 0.9 seconds for SpotIt+, with the majority of queries verified in under a second. The addition of constraints actually reduced runtime by narrowing the search space.

5. Significance and Impact

Higher Quality Benchmarks: SpotIt+ provides a more rigorous standard for Text-to-SQL evaluation, preventing models from "gaming" the system by matching results on specific test data while failing on general logic.
Practical Relevance: By filtering out "corner case" counterexamples that cannot occur in real databases, the tool provides feedback that is actionable for developers and researchers.
Scalability: The integration of LLMs for constraint repair does not hinder performance; in fact, the constrained search space makes the SMT solver faster.
Future Directions: The authors suggest extending this approach to support cross-table relationships, larger SQL fragments, and user-specified domain knowledge.

In summary, SpotIt+ bridges the gap between formal verification and practical database evaluation, offering a robust framework that ensures Text-to-SQL systems are not just syntactically correct on test data, but semantically equivalent in realistic scenarios.