TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

Imagine you are a detective trying to solve a mystery, but you've been dropped into a massive, dark library with millions of books. You don't have a catalog, a map, or even a list of what's on the shelves. All you know is the question you need to answer: "Who bought the most expensive item last year?"

The Old Way: The "Full Library" Problem

Most current AI detectives (Text-to-SQL models) work under a strange rule: They are only allowed to solve the case if someone first dumps the entire library's catalog into their brain.

The Problem: In the real world, databases are like giant, messy warehouses with thousands of tables (shelves) and noisy, outdated labels. Trying to stuff the entire catalog into the detective's brain is impossible (it's too big) and actually harmful (it's too much noise, making them forget the important clues).
The Result: If the detective guesses a book title that doesn't exist, they hallucinate (make things up) and fail.

The New Way: TRUST-SQL (The Active Detective)

The paper introduces TRUST-SQL, a new kind of detective that doesn't wait for a catalog. Instead, it learns to actively explore the library, find the right books, and verify them before solving the case.

Here is how it works, using a simple 4-step routine:

Explore (The Scouting Mission): The detective walks up to a shelf and asks, "What books are here?" It queries the database to see what tables and columns actually exist.
Propose (The "Wait, Let's Check" Moment): This is the most important step. Before writing the final answer, the detective stops and says, "Okay, I've checked the shelves. I am 100% sure the 'Customers' table exists and has a 'Spent' column. I will now commit to this list."
- Why this matters: This stops the detective from making up fake book titles. It forces them to stick to what they actually saw.
Generate (Writing the Report): Now that they have a verified list of books, they write the SQL query (the report) based only on those confirmed facts.
Confirm (The Final Check): They run the query to see if it works. If it fails, they go back to step 1.

The Secret Sauce: "Dual-Track" Training

The hardest part of teaching a detective is grading.

If the detective finds the right books but writes a bad report, did they fail?
If they write a great report but used a book that doesn't exist, did they fail?

Old methods gave a single grade at the very end, which confused the detective. TRUST-SQL uses a "Dual-Track" grading system:

Track A (The Explorer): Grades the detective only on how well they found the right books.
Track B (The Writer): Grades the detective only on how well they wrote the report using those books.

This way, the detective learns to be a great explorer and a great writer separately, without one mistake ruining the lesson for the other.

The Results: Why It's a Big Deal

The researchers tested this on 5 different "libraries" (benchmarks).

The Surprise: Even though TRUST-SQL started with zero knowledge of the library (no pre-loaded catalog), it performed just as well as, or even better than, the top detectives who were given the full catalog upfront.
The Efficiency: It didn't waste time reading irrelevant books. It only looked for what it needed.
The Improvement: For smaller AI models, this method improved their success rate by over 30%.

The Takeaway

TRUST-SQL teaches AI to stop being a passive reader who memorizes a list and start being an active investigator. In a world where data is messy, huge, and constantly changing, the ability to "look before you leap" and verify facts in real-time is the key to solving complex problems.

In short: Instead of giving the AI a giant, confusing map, we taught it how to use a flashlight to find its own way through the dark.

1. Problem Definition: The Unknown Schema Setting

Current Text-to-SQL research largely operates under the Full Schema Assumption, where the entire database schema (tables, columns, relationships) is pre-loaded into the model's context. While effective for standard benchmarks, this approach fails in real-world enterprise environments where:

Databases contain hundreds of tables with massive, noisy metadata.
Schemas evolve frequently (additions, deletions, restructuring).
Injecting the full schema exceeds context window limits and distracts models with irrelevant information.

The Challenge: The paper formalizes the Unknown Schema setting, where an agent must autonomously explore a hidden database environment to identify and verify only the relevant metadata required to answer a natural language query. This requires moving from passive translation to active, multi-turn tool-integrated decision-making.

2. Methodology: TRUST-SQL Framework

The authors propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools), which addresses the problem through a structured interaction protocol and a novel reinforcement learning (RL) strategy.

A. Four-Phase Interaction Protocol

The task is formulated as a Partially Observable Markov Decision Process (POMDP). The agent follows a strict four-phase workflow to prevent hallucinations:

Explore: The agent queries database metadata (e.g., sqlite_master) to discover tables and columns.
Propose (Cognitive Checkpoint): The agent must commit to a verified schema subset before generating SQL. This acts as a mandatory checkpoint to ground reasoning in observed facts, preventing the model from fabricating non-existent structures.
Generate: The agent generates a candidate SQL query based only on the verified schema from the Propose phase.
Confirm: The agent submits the final SQL. If execution fails, the agent can loop back to refine the schema or SQL.

B. Dual-Track GRPO (Group Relative Policy Optimization)

A core innovation is the Dual-Track GRPO strategy, designed to solve the credit assignment problem in long multi-turn trajectories. Standard RL often conflates exploration errors with generation errors, making it hard to learn which specific action caused a failure.

Track Decomposition: Each interaction trajectory is split into two optimization tracks:
- Schema Track ( $\tau_{schema}$ ): Ends at the Propose checkpoint. Optimized using a Schema Reward ( $R_{schema}$ ) based on the overlap between the proposed schema and the ground truth schema.
- Full Track ( $\tau_{full}$ ): Spans the entire interaction. Optimized using an Execution Reward ( $R_{exec}$ ) and a Format Reward ( $R_{fmt}$ ).
Token-Level Masked Advantages: The strategy applies strict masking to advantages. Exploration rewards are only backpropagated to tokens generated during the exploration phase, and generation rewards are isolated to the generation phase. This prevents the model from receiving false credit for good SQL generation if the schema was guessed incorrectly, or vice versa.
Loss Function: The total loss combines both tracks: $L(\theta) = L_{full}(\theta) + \lambda \cdot L_{schema}(\theta)$ .

C. Reward Design

Execution Reward: Binary or partial credit based on whether the SQL executes and matches the ground truth result.
Schema Reward: Evaluates the quality of the schema proposed at the checkpoint. Crucially, the paper finds that coupling this reward with successful execution (Sparse + Coupled) yields the best results, ensuring the agent learns to retrieve precise schemas that lead to correct answers.
Format Reward: Enforces adherence to the four-phase protocol structure.

3. Key Contributions

Autonomous Framework: TRUST-SQL is the first framework to successfully close the loop from unconstrained exploration to grounded SQL generation without relying on static context pre-loading.
Dual-Track GRPO: A novel training strategy that disentangles schema exploration from SQL generation using token-level masked advantages. This yields a 9.9% relative improvement over standard GRPO on the BIRD-Dev benchmark.
Protocol Innovation: The introduction of the mandatory "Propose" checkpoint effectively suppresses hallucinations, reducing hallucination errors by 9.4x compared to baseline models without this checkpoint.

4. Experimental Results

The authors evaluated TRUST-SQL on five benchmarks (BIRD-Dev, Spider-Test, Spider-DK, Spider-Syn, Spider-Realistic) using Qwen3-4B and Qwen3-8B base models.

Performance Gains:
- 4B Variant: Achieved an average absolute improvement of 30.6% over the base model.
- 8B Variant: Achieved an average absolute improvement of 16.6% over the base model.
- TRUST-SQL-8B achieved 65.8% execution accuracy on BIRD-Dev (Greedy), outperforming strong baselines like OmniSQL-7B and SQL-Trail-7B, despite having no pre-loaded schema.
Robustness: The model significantly outperformed baselines on robustness benchmarks (Spider-Syn, Spider-Realistic), proving that active exploration generalizes better to perturbed and ambiguous scenarios than memorized schema patterns.
Efficiency: Despite multi-turn interactions, TRUST-SQL-4B consumed only 2.83K tokens per query, comparable to single-turn models with full schema access, and was 113x more token-efficient than training-free tool-augmented methods like CHESS.
Ablation Studies:
- Removing the SFT warm-up phase led to the agent "hacking" the reward by querying all tables immediately, confirming the necessity of supervised initialization.
- Injecting the full schema (Schema Prefill) into TRUST-SQL provided negligible gains and sometimes degraded performance, confirming that the active exploration policy is superior to static pre-filling.

5. Significance and Conclusion

TRUST-SQL represents a paradigm shift in Text-to-SQL research. It demonstrates that autonomous database exploration is not only feasible but superior to the traditional Full Schema Assumption in complex, real-world environments.

Practical Impact: It enables LLMs to interact with large, evolving enterprise databases without hitting context limits or suffering from noise-induced hallucinations.
Methodological Advance: The Dual-Track GRPO approach provides a blueprint for resolving credit assignment in multi-turn, tool-integrated RL tasks where different phases of an agent's behavior have distinct objectives.
Future Direction: The work suggests that for complex reasoning tasks, "active discovery" of necessary information is more robust than "passive consumption" of all available information.

The paper concludes that TRUST-SQL establishes a new standard for reliable Text-to-SQL in unobservable environments, achieving state-of-the-art performance while operating entirely without pre-loaded metadata.