Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

🧠 The Big Problem: The "Enterprise Trilemma"

Imagine you run a massive company with a huge library of data (a database). You want your employees to ask questions in plain English (like "Show me the top 5 movies from 2023") and have a computer instantly write the complex code (SQL) to find that answer.

This is the Text-to-SQL problem.

Currently, companies face a "Trilemma" (a three-way impossible choice):

The Giant Brain (High Cost): Use a massive, super-smart AI (like GPT-4). It's brilliant but costs a fortune to run and requires sending your secret data to a third party (Security risk).
The Local Brain (Low Performance): Run a smaller, cheaper AI on your own servers. It's cheap and secure, but it's often "dumb" and makes silly mistakes with complex questions.
The Middle Ground: Try to make the small brain smarter without breaking the bank.

The Paper's Goal: How do we make the small, local brain as smart as the giant brain without paying the giant's price?

🎓 The Solution: Teaching a Student with a Blueprint

The authors tried a technique called Knowledge Distillation. Think of this as a Master Chef (Teacher) teaching a Junior Chef (Student).

The Old Way (Unstructured CoT): The Master Chef says, "First, I think about the ingredients, then I chop them, then I fry them..." while rambling in a free-flowing, conversational way. The Junior Chef listens but gets confused by the rambling. They might forget to check if the pan is hot or grab the wrong spice.
The New Way (Struct-SQL): The Master Chef provides a strict, step-by-step Blueprint (like a recipe card or a construction plan).
- Step 1: Check the "Movie" table.
- Step 2: Scan for the "Popularity" column.
- Step 3: Filter for the top 5.
- Step 4: Write the code.

The paper argues that for complex tasks like writing database code, rambling thoughts aren't enough. The student needs a formal logical blueprint (called a Query Execution Plan) to follow.

🛠 How It Works (The "Struct-SQL" Method)

The Teacher: A massive AI (GPT-4o) is asked to solve a problem. Instead of just giving the answer, it is forced to write out a formal plan first, like a database engineer would. It breaks the problem down into: Scan Table → Join Data → Filter Results → Group Data.
The Training: The small AI (the Student) is trained to copy both the plan and the final answer. It learns to say, "Okay, first I must scan the table, then I must join..." before it even thinks about writing the code.
The Result: The small AI learns to "think" like a database engine, not just like a chatterbox.

📊 The Results: Why It Matters

The researchers tested this on a famous database challenge called BIRD. Here is what happened:

The Small Brain (Untuned): Got about 17% of questions right. It was hallucinating (making up table names that didn't exist).
The Small Brain (Old Way): Got about 37% right. It was better, but still made grammar mistakes.
The Small Brain (Struct-SQL): Got 45% right.

The "Aha!" Moment:
The biggest improvement wasn't that the AI got smarter at logic; it was that it stopped making silly mistakes.

Analogy: Imagine a student taking a math test.
- Old Way: The student knows the math but forgets to write the "plus" sign or writes the wrong number. (Syntactic errors).
- New Way (Struct-SQL): The student follows a checklist. "Did I write the plus sign? Yes. Did I pick the right numbers? Yes."
- Result: The Struct-SQL model made far fewer "typos" and "hallucinations" because the blueprint forced it to be precise.

🚀 Why This is a Game Changer

Security & Privacy: You can now run a super-smart AI on your own company server (private) without sending your data to the cloud.
Cost: It's much cheaper to run a small model than a giant one.
Reliability: By forcing the AI to follow a "blueprint," it stops making up fake database tables. It acts more like a professional engineer and less like a creative writer.

⚠️ The Catch (Limitations)

There is one trade-off. Because the AI has to write out the "blueprint" before giving the answer, it takes more time and computer power (about 3.6x more "tokens" or words) than the old method. However, the authors argue this extra cost is still much lower than hiring the "Giant Brain" (GPT-4) and is worth it for the accuracy boost.

🏁 The Bottom Line

This paper proves that if you want a small, cheap, private AI to do complex database work, you shouldn't just let it "chat" its way to an answer. You need to teach it to follow a strict, structured plan.

In short: Don't just teach the student what the answer is; teach them how to build the ladder to get there. That's how you turn a small, clumsy robot into a reliable database expert.

1. Problem Statement: The Enterprise Adoption Trilemma

The paper addresses a critical bottleneck in deploying Text-to-SQL systems at the enterprise level, described as an "Adoption Trilemma" involving three conflicting factors:

Cost: High-performance Large Language Models (LLMs) require significant computational resources or expensive API usage.
Security: Using external APIs for proprietary LLMs poses risks regarding sensitive database schemas and data transmission.
Performance: Small Language Models (SLMs) suitable for private, low-cost deployment often lack the zero-shot accuracy required for complex queries.

While advanced reasoning techniques like Chain-of-Thought (CoT) and Query Plan CoT (QP-CoT) have significantly improved LLM performance, these methods fail to transfer effectively to SLMs. SLMs struggle to internalize unstructured reasoning traces, leading to high rates of schema hallucination (generating non-existent tables/columns) and syntactic errors, even when prompted with structured logic.

2. Methodology: The Struct-SQL Framework

The authors propose Struct-SQL, a Knowledge Distillation (KD) framework designed to transfer reasoning capabilities from a powerful Teacher LLM to a smaller Student SLM using a structured reasoning signal rather than unstructured text.

Core Hypothesis

The paper hypothesizes that a formal, structured representation of reasoning (specifically a Query Execution Plan) provides a clearer, less ambiguous supervisory signal for SLMs than natural language CoT traces.

Technical Workflow

Teacher Model (Oracle): A state-of-the-art LLM (GPT-4o) generates the ground truth. Instead of just outputting SQL, it is prompted to generate a QP-CoT trace. This trace acts as a formal blueprint, decomposing the query into logical steps: Schema Linking $\rightarrow$ Table/Index Scan $\rightarrow$ Selection $\rightarrow$ Joining $\rightarrow$ Filtering $\rightarrow$ Grouping.
Distillation Signal: The supervisory signal ( $Z_T$ ) consists of the concatenation of the Structured Query Plan ( $R_{QP-CoT}$ ) and the final SQL Query ( $Y_T$ ).
Student Model Training: A smaller model (Qwen3-4B-Instruct-2507) is fine-tuned using Parameter-Efficient Fine-Tuning (QLoRA) to replicate the entire sequence (Plan + SQL).
Inference: During inference, the Student Model is prompted to autonomously generate the structured Query Plan first, then synthesize the final SQL query based on that plan.

Experimental Setup

Dataset: BIRD benchmark (mini-dev for evaluation, full training set for distillation).
Baselines:
- FN-Gold: Fine-tuning only on the final SQL query.
- ReasonSQL: Distillation using unstructured natural language CoT traces (the standard KD approach).
- Student Model (Zero-shot): The base SLM without tuning.
Data Construction: A stratified sampling method ensured the training set covered various SQL complexities (single-table, subqueries, joins, set operations) while filtering for execution-correct SQL.

3. Key Contributions

Systematic Evaluation of Structured KD: The first work to systematically evaluate the impact of distilling structured reasoning signals (Query Plans) versus unstructured CoT for Text-to-SQL.
Error Analysis & Curriculum Learning: Demonstrates that structured signals act as a superior "curriculum," specifically reducing syntactic errors (schema hallucinations) which are the primary failure mode of SLMs.
Generalization: Validated the framework across two different SLM architectures (Qwen3-4B and Mistral-7B), proving the method is architecture-agnostic.
Open Source: Released code, models, and datasets to facilitate reproducible research.

4. Results

The experiments on the BIRD mini-dev benchmark yielded significant performance gains:

Execution Accuracy (EX):
- Struct-SQL: 45.00%
- ReasonSQL (Unstructured KD): 36.90%
- FN-Gold (Standard Fine-tuning): 34.30%
- Base Student Model: 17.00%
- Teacher (GPT-4o): 53.60%
- Result: Struct-SQL achieved an 8.1% absolute improvement over the unstructured KD baseline (ReasonSQL) and covers 84% of the Teacher's performance.
Error Reduction Analysis:
- Syntactic Errors: Struct-SQL reduced total syntactic errors from 21.2% (ReasonSQL) to 16.8%. It specifically eliminated "Keyword Issues" and reduced "No Such Column" hallucinations significantly.
- Generation Failures: Reduced from 2.2% to 0.4%.
- Semantic Errors: Improved logical reliability, reducing "Empty Output" errors.
Ablation Study (Prompt vs. Training):
- When the ReasonSQL model (trained on unstructured CoT) was tested with the structured QP-CoT prompt, performance dropped to 29.2%. This proves that the gain comes from internalizing the logical decomposition during training, not just the prompt format at inference.
Generalization: On Mistral-7B, Struct-SQL achieved 29.31% EX compared to 25.10% for ReasonSQL, confirming the method's robustness across different base models.
Official Benchmark: On the non-public BIRD test set, Struct-SQL (4B parameters) achieved 60.42% EX, ranking 1st globally among models with $\le$ 4B parameters (as of Jan 2026).

5. Significance and Implications

Solving the Trilemma: Struct-SQL demonstrates that enterprises can deploy private, low-cost SLMs that achieve near-LLM accuracy, effectively breaking the trade-off between cost/security and performance.
Shift in Distillation Paradigm: The paper argues that for complex tasks like Text-to-SQL, the structure of the reasoning is as important as the reasoning content itself. Unstructured CoT is insufficient for SLMs to learn strict logical constraints; a formal blueprint is required.
Efficiency: The method is computationally efficient, converging in ~29 minutes on a single H200 GPU using only 1,000 high-quality samples, making it highly scalable.
Limitations: The approach introduces inference overhead (3.6x more tokens due to plan generation) and is bounded by the Teacher's ability to solve edge cases. However, these costs are deemed acceptable compared to the operational costs of running large proprietary models.

In conclusion, Struct-SQL establishes that transferring structured logical blueprints (Query Execution Plans) is a superior strategy for distilling Text-to-SQL capabilities into Small Language Models, significantly reducing syntactic failures and enabling reliable, private enterprise deployment.