SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

Imagine you have a massive library of books, but instead of reading them, you want to ask a librarian specific questions about the data inside them, like "How many books were published in 1990?" or "Which author has the most bestsellers?"

In the world of computers, this is called Text-to-SQL. You type a question in plain English, and the computer translates it into a special code (SQL) to dig up the answer from a database.

The problem? The computers (AI models) we have today are like brilliant students who have only read a few small storybooks. They struggle when asked to navigate a massive, messy, real-world library with thousands of interconnected shelves and confusing labels. They need more practice, but the "practice books" available so far are too simple, too small, or too fake.

Enter SQALE.

What is SQALE?

Think of SQALE as a massive, super-realistic training gym for these AI librarians.

The researchers built a dataset containing over 517,000 practice questions paired with 135,000 different database blueprints (schemas). But here's the magic: they didn't just make up fake libraries. They started with a pile of real database blueprints from the real world (called "SchemaPile") and used a smart AI assistant to expand them, creating new, complex, and realistic scenarios.

How did they build it? (The Construction Site Analogy)

Imagine you are an architect who wants to train a robot to build houses. You don't just give the robot a drawing of a cardboard box. You give it a pile of real blueprints for garages, then ask the robot to imagine adding a second floor, a garage, and a pool, while keeping the style consistent.

The SQALE team did exactly this in three steps:

The Foundation (Real Blueprints): They took real database structures from the wild. These weren't perfect, clean diagrams; they were messy, like real life, with weird naming conventions and missing connections.
The Expansion (The "What If" Game): They used a powerful AI (Qwen3) to act as an "Architect." They told the AI: "Take this small garage blueprint and expand it into a massive mansion with 100 rooms, but keep the style the same. Add some tricky connections between rooms." This created huge, complex databases that didn't exist before but felt exactly like real ones.
The Questions (The Tour Guide): Once the "mansion" was built, they asked the AI to act as a confused tourist. "How do I get from the kitchen to the attic?" or "Show me all the rooms with blue doors." The AI then generated the correct "map" (SQL code) to answer that question.

Why is this a big deal?

Previous training sets were like toy train sets. They had a few tracks and a few trains. They were clean and predictable.

Spider 2.0 (a previous benchmark) was like a model village. Better, but still small.
SQALE is like a real, sprawling metropolis.

Here is why SQALE is special:

It's Messy (in a good way): Real databases have missing links and weird names. SQALE includes these "imperfections" so the AI learns to handle real-world confusion, not just perfect textbook examples.
It's Huge: It has over 13 million "connections" (foreign keys) between data points. This forces the AI to learn how to navigate complex relationships, not just simple lookups.
It's Diverse: Some questions are easy ("What's the weather?"), and some are incredibly hard ("Find the average salary of employees who worked on projects that failed, grouped by department, excluding those hired before 2010").

The Goal

The researchers believe that if you feed an AI more data, it gets smarter (this is known as "scaling laws"). By giving these AI models a "university education" based on SQALE instead of a "high school education" based on older datasets, they hope to create models that can finally handle the complex, messy databases used by actual businesses, hospitals, and governments.

In short: SQALE is the ultimate "flight simulator" for AI. Instead of practicing in a calm, empty sky, it's training in a stormy, crowded airspace with real traffic, so when it finally flies a real plane (answers a real business question), it won't crash.

Metric	SQALE	Spider 2.0	BIRD	SynSQL
# Schemas	135,875	236	80	16,575
Median Tables/Schema	91.0	7.0	5.0	10.0
Median Columns/Schema	435.0	89	39	72
Total Foreign Keys	13.2 Million	0	526	159,547
Join Complexity	76.1% of queries use joins	72.0%	76.2%	89.4%
Query Diversity	High (Aggregations, Nested, Set Ops)	High	High	Moderate

SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

What is SQALE?

How did they build it? (The Construction Site Analogy)

Why is this a big deal?

The Goal

1. Problem Statement

2. Methodology: The SQALE Generation Pipeline

A. Foundation: SchemaPile

B. Schema Extension (Scaling)

C. Question and Query Synthesis

3. Key Contributions

4. Results and Dataset Characteristics

5. Significance and Impact

SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas

What is SQALE?

How did they build it? (The Construction Site Analogy)

Why is this a big deal?

The Goal

1. Problem Statement

2. Methodology: The SQALE Generation Pipeline

A. Foundation: SchemaPile

B. Schema Extension (Scaling)

C. Question and Query Synthesis

3. Key Contributions

4. Results and Dataset Characteristics

5. Significance and Impact

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá