SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

This paper introduces SQaLe, a large-scale semi-synthetic text-to-SQL dataset containing over 517,000 high-quality triples derived from 135,875 real-world database schemas, designed to overcome current data limitations and advance model generalization through realistic schema variability and diverse query patterns.

Cornelius Wolff, Daniel Gomm, Madelon Hulsebos

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you have a massive library of books, but instead of reading them, you want to ask a librarian specific questions about the data inside them, like "How many books were published in 1990?" or "Which author has the most bestsellers?"

In the world of computers, this is called Text-to-SQL. You type a question in plain English, and the computer translates it into a special code (SQL) to dig up the answer from a database.

The problem? The computers (AI models) we have today are like brilliant students who have only read a few small storybooks. They struggle when asked to navigate a massive, messy, real-world library with thousands of interconnected shelves and confusing labels. They need more practice, but the "practice books" available so far are too simple, too small, or too fake.

Enter SQALE.

What is SQALE?

Think of SQALE as a massive, super-realistic training gym for these AI librarians.

The researchers built a dataset containing over 517,000 practice questions paired with 135,000 different database blueprints (schemas). But here's the magic: they didn't just make up fake libraries. They started with a pile of real database blueprints from the real world (called "SchemaPile") and used a smart AI assistant to expand them, creating new, complex, and realistic scenarios.

How did they build it? (The Construction Site Analogy)

Imagine you are an architect who wants to train a robot to build houses. You don't just give the robot a drawing of a cardboard box. You give it a pile of real blueprints for garages, then ask the robot to imagine adding a second floor, a garage, and a pool, while keeping the style consistent.

The SQALE team did exactly this in three steps:

  1. The Foundation (Real Blueprints): They took real database structures from the wild. These weren't perfect, clean diagrams; they were messy, like real life, with weird naming conventions and missing connections.
  2. The Expansion (The "What If" Game): They used a powerful AI (Qwen3) to act as an "Architect." They told the AI: "Take this small garage blueprint and expand it into a massive mansion with 100 rooms, but keep the style the same. Add some tricky connections between rooms." This created huge, complex databases that didn't exist before but felt exactly like real ones.
  3. The Questions (The Tour Guide): Once the "mansion" was built, they asked the AI to act as a confused tourist. "How do I get from the kitchen to the attic?" or "Show me all the rooms with blue doors." The AI then generated the correct "map" (SQL code) to answer that question.

Why is this a big deal?

Previous training sets were like toy train sets. They had a few tracks and a few trains. They were clean and predictable.

  • Spider 2.0 (a previous benchmark) was like a model village. Better, but still small.
  • SQALE is like a real, sprawling metropolis.

Here is why SQALE is special:

  • It's Messy (in a good way): Real databases have missing links and weird names. SQALE includes these "imperfections" so the AI learns to handle real-world confusion, not just perfect textbook examples.
  • It's Huge: It has over 13 million "connections" (foreign keys) between data points. This forces the AI to learn how to navigate complex relationships, not just simple lookups.
  • It's Diverse: Some questions are easy ("What's the weather?"), and some are incredibly hard ("Find the average salary of employees who worked on projects that failed, grouped by department, excluding those hired before 2010").

The Goal

The researchers believe that if you feed an AI more data, it gets smarter (this is known as "scaling laws"). By giving these AI models a "university education" based on SQALE instead of a "high school education" based on older datasets, they hope to create models that can finally handle the complex, messy databases used by actual businesses, hospitals, and governments.

In short: SQALE is the ultimate "flight simulator" for AI. Instead of practicing in a calm, empty sky, it's training in a stormy, crowded airspace with real traffic, so when it finally flies a real plane (answers a real business question), it won't crash.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →