SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

SpotIt+ is an open-source tool that evaluates Text-to-SQL systems by actively generating database counterexamples to verify query equivalence, utilizing a novel constraint-mining pipeline to ensure these differentiating instances reflect practically relevant discrepancies missed by standard testing.

Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are a teacher grading a student's homework. The assignment is to write a computer program (specifically, a SQL query) that answers a question about a database, like "Show me all customers who live in New York."

The student submits their answer. To grade it, you have two main ways to check if they are right:

1. The Old Way: The "Sample Test" (Test-Based Evaluation)

This is how most people currently check SQL answers. You take the student's code and run it on one specific, small list of fake data that you prepared in advance.

  • The Analogy: Imagine you ask a student to sort a deck of cards. You hand them a specific deck of 52 cards. They sort it, and it looks perfect. You give them an A.
  • The Problem: What if the student used a weird trick that only works on that specific deck? If you gave them a different deck (maybe one with a missing card or a duplicate), their trick would fail, but you'd never know because you only tested the one deck. In the world of databases, this means a student might get an "A" for a query that is actually broken, just because it got lucky with the specific data you gave them.

2. The New Way: The "Logic Detective" (Verification-Based Evaluation)

This is what the paper's authors, Andrew and his team, are doing with their new tool called SpotIt+. Instead of just running the code on one list of data, they ask a super-smart logic engine: "Is there ANY possible list of data in the universe where this student's answer would be different from the correct answer?"

  • The Analogy: Instead of just checking the one deck of cards, the teacher asks a detective, "Can you imagine any deck of cards where this sorting method fails?" If the detective finds even one weird deck where the student's code breaks, the student gets a failing grade. This is much stricter and fairer.

The Big Problem with the "Logic Detective"

The "Logic Detective" is so smart that it can find any difference, even differences that would never happen in real life.

  • The Analogy: Imagine the detective finds a counterexample where a person's age is -500 years or their name is a string of random symbols like @#$%. The detective says, "Aha! Your code fails if someone is -500 years old!"
  • The Issue: While technically true, that's a silly scenario. In the real world, people aren't -500 years old. If we use these "silly" counterexamples to grade students, we might unfairly punish them for not handling impossible situations. We want the detective to find realistic mistakes, not impossible ones.

The Solution: SpotIt+ (The "Real-World Filter")

This is where SpotIt+ comes in. It adds a special step to the process: Constraint Mining.

  1. Learning the Rules: Before the detective starts searching, SpotIt+ looks at the example data (the "training deck") and learns the unwritten rules of the real world.
    • It learns: "Heads are always red, tails are always black."
    • It learns: "People's ages are between 0 and 120."
    • It learns: "If a customer has an ID, they must have a name."
  2. The LLM Validator (The "Sense Check"): Sometimes, the rules SpotIt+ finds are too strict (e.g., "Everyone in the database is exactly 30 years old"). A Large Language Model (LLM)—basically a very smart AI—steps in to say, "Wait, that rule is too specific to this one list of data. In the real world, people can be 20 or 40 too. Let's relax that rule."
  3. The Final Check: Now, the Logic Detective searches for mistakes, but it is only allowed to imagine realistic scenarios. It can't use -500-year-olds or impossible names.

Why This Matters

The paper tested this new tool on the BIRD dataset (a huge collection of real-world database questions).

  • The Result: SpotIt+ found that many "A" grades given by the old "Sample Test" method were actually wrong. The students' code worked on the sample data but would fail in the real world.
  • The Benefit: By using realistic constraints, SpotIt+ catches the real bugs without wasting time on impossible bugs. It's like a teacher who doesn't just check if the answer is right on one test, but checks if the student truly understands the subject so they can handle any real-world situation.

Summary

  • Old Method: "Does it work on this one specific list?" (Too easy, misses hidden bugs).
  • Basic Verification: "Does it work on any list, even impossible ones?" (Too hard, punishes students for silly edge cases).
  • SpotIt+: "Does it work on any realistic list?" (Just right. It finds the real mistakes that matter).

SpotIt+ is essentially a smarter, fairer grading system for AI that writes database code, ensuring that when we say an AI is "correct," it actually works in the messy, real world.