SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation

Imagine you are teaching a very smart but inexperienced apprentice (an AI) how to fix a complex plumbing system (a database) by writing instructions (SQL code).

In the old way of teaching (traditional AI), you would let the apprentice write a single instruction. If the pipe burst, you'd say, "Fail." If the water flowed perfectly, you'd say, "Success." You wouldn't tell them why they failed or which specific wrench they held wrong. This is like giving a student a final exam grade of "F" without showing them which math problems they got wrong. The apprentice gets frustrated, doesn't know what to improve, and learning is slow.

The paper SQL-ASTRA introduces a new, much smarter way to teach this apprentice. It treats the process not as a single test, but as a multi-turn conversation where the apprentice can try, check the result, fix mistakes, and try again. To make this work, they invented two special "coaching tools."

1. The "Partial Credit" Coach (CSMR)

The Problem: In the old days, if the apprentice got 9 out of 10 pipes connected correctly but missed one, the teacher would still give them a "Fail" (0 points). This is unfair and unhelpful. It's like failing a driving test because you parked 2 inches too far from the curb, ignoring that you drove perfectly the rest of the way.

The Solution (Column-Set Matching):
The new coach, CSMR, looks at the ingredients of the answer, not just the final dish.

Analogy: Imagine the apprentice is making a salad. The goal is to have lettuce, tomatoes, and cucumbers.
- Old Coach: "You put the tomatoes and cucumbers in the wrong bowl order. Fail." (0 points).
- CSMR Coach: "Hey, you got the tomatoes and cucumbers! That's great! But you missed the lettuce. You get 0.6 points."
Why it helps: By giving "partial credit" (dense feedback) for getting some parts right, the apprentice learns exactly what to keep and what to fix. It turns a scary "All or Nothing" game into a helpful step-by-step guide.

2. The "Energy Meter" Coach (ATR)

The Problem: Even with partial credit, the apprentice might get stuck in a loop. They might try a fix, get a little better, then try a different fix and go back to being worse, then try the first fix again. They are running in circles (a "limit cycle") without ever actually solving the problem.

The Solution (Aggregated Trajectory Reward):
The second coach, ATR, looks at the entire journey of the apprentice's attempts, not just the current step.

Analogy: Imagine the apprentice is hiking up a mountain to find a treasure (the correct SQL query).
- The Trap: Sometimes, hikers get stuck in a valley, walking up a small hill, then sliding back down, then walking up the same hill again. They are moving, but not getting closer to the peak.
- The Energy Meter: ATR acts like a strict energy meter based on physics (Lyapunov stability). It says: "Every time you take a step, you must use up some energy. If you walk in a circle, you lose more energy than you gain."
- The Result: Because the "energy cost" of looping is too high, the apprentice is mathematically forced to stop circling and keep moving uphill toward the solution. It guarantees they won't get stuck in an infinite loop of mistakes.

The Grand Result

By combining these two coaches:

CSMR gives the apprentice a detailed map of where they are right now (even if they aren't perfect yet).
ATR ensures the apprentice keeps moving forward and never gets stuck in a loop of bad habits.

The Outcome:
When they tested this new method on difficult database puzzles (like the BIRD and Spider datasets), the AI didn't just get slightly better; it jumped ahead of the current "State-of-the-Art" models. It learned to think like a human data analyst: asking a question, checking the answer, realizing a mistake, and refining the query until it was perfect.

In short: SQL-ASTRA stops treating AI like a student who only gets a final grade, and starts treating it like a trainee who gets constant, detailed feedback and a strict rule against running in circles. This makes the AI smarter, faster, and much more reliable at solving real-world problems.

1. Problem Statement

The paper addresses critical bottlenecks in applying Agentic Reinforcement Learning (RL) to Text-to-SQL tasks. While Agentic RL has shown promise in complex, multi-turn domains (e.g., web search, code execution), its application to Text-to-SQL remains largely restricted to single-turn paradigms due to three core challenges:

Paradigm Constraint: Existing methods often force a single-turn generation, failing to mimic the dynamic, iterative process of human data analysts who use multiple tentative queries to refine strategies.
Credit Assignment Problem: In multi-turn trajectories, traditional RL relies solely on final-turn feedback (binary 0/1 success). This "all-or-nothing" approach treats the interaction as a black box, making it impossible for the agent to determine which intermediate steps contributed to the final success or failure.
Micro-level Reward Sparsity: Even when step-level feedback is attempted, it is typically coarse and binary. This ignores "partially correct" queries (e.g., correct columns but wrong rows), providing insufficient granular guidance for efficient training.

2. Methodology: The Agentic SQL Framework

The authors propose Agentic SQL, a framework that models Text-to-SQL as a Finite-Horizon Markov Decision Process (MDP). The core innovation is a universal two-tiered reward mechanism designed to provide dense, process-oriented signals.

A. Column-Set Matching Reward (CSMR)

To solve the sparsity of binary rewards, the authors introduce CSMR, a dense, step-level reward function.

Mechanism: Instead of comparing execution result rows (tuples) directly, CSMR compares the sets of unique values within each column of the predicted result ( $P$ ) versus the ground truth ( $G$ ).
Partial Correctness: It calculates a score based on the overlap of column value-sets. Even if row ordering or composition is slightly off, matching column values yield a positive reward in the range $[0, 1]$ .
Scaling Factor ( $\alpha$ ): A scaling factor (e.g., $\alpha=0.8$ ) is applied to cap the reward for "pseudo-perfect" matches (where column values match but row combinations are wrong), distinguishing them from truly perfect matches.
Impact: This converts sparse binary signals into dense, granular feedback, capturing rich information from error cases that traditional methods discard.

B. Aggregated Trajectory Reward (ATR)

To solve the credit assignment problem across multiple turns, the authors propose ATR, a trajectory-level reward.

Mechanism: ATR aggregates step-wise CSMR scores using an Asymmetric Transition Matrix. It evaluates the direction and magnitude of semantic changes between turns.
Asymmetry: The matrix imposes a strict penalty on degradation (moving from a high-reward state to a low-reward state) that is significantly larger than the reward for improvement.
- $|R_{High \to Low}| > |R_{Low \to High}|$
Theoretical Guarantee (Lyapunov Stability):
- The authors model the reasoning process as a dynamical system where the CSMR score represents "Semantic Error Energy."
- They prove that ATR acts as an energy dissipation operator.
- Result: This mathematical guarantee ensures asymptotic stability (convergence to the correct SQL) and, crucially, eliminates limit cycles (oscillations between suboptimal states) by ensuring the net reward over any cycle is negative.

C. Training Algorithm

The framework utilizes GRPO (Group Relative Policy Optimization).

Tool Masking: A binary mask is applied during loss calculation to ensure the model focuses on learning the reasoning process rather than just the execution tokens.
Advantage Calculation: The normalized ATR serves as the advantage signal for the entire trajectory, guiding the policy $\pi_\theta$ to maximize cumulative improvement.

3. Key Contributions

Novel Reward Mechanism: Introduction of CSMR and ATR, creating a two-tiered system that provides both immediate dense feedback and long-term trajectory guidance.
Theoretical Rigor: The first application of Lyapunov stability theory to Text-to-SQL RL reward design, mathematically proving that the asymmetric reward structure guarantees cycle-free policies and monotonic convergence.
Agentic Paradigm Shift: Successfully transitioning Text-to-SQL from a static, single-turn generation task to a dynamic, multi-turn interactive agent paradigm without requiring a cold-start phase.

4. Experimental Results

The framework was evaluated on BIRD-Dev, Spider, and the challenging enterprise-grade Spider 2.0 datasets.

Performance Gains:
- On BIRD, Agentic SQL outperformed the binary-reward GRPO baseline by 5.7% (using Qwen2.5-7B-Instruct).
- On Spider, it achieved a 3.7% gain.
- On Spider 2.0, the model achieved 17.7% accuracy, significantly outperforming SOTA models like Arctic-Text2SQL-R1-7B (15.6%) and SQL-R1, despite using identical base models.
Ablation Studies:
- CSMR: Consistently outperformed binary rewards across all settings, validating the importance of dense signals.
- ATR: Removing the asymmetric matrix (using a symmetric one) led to repetitive loops and lower efficiency, confirming the necessity of the energy dissipation mechanism.
- Trajectory Aggregation: Direct step-wise updates without group normalization performed worse, highlighting the importance of aggregating signals for credit assignment.
Efficiency: While the multi-turn rollout takes roughly twice as long as single-turn methods, the convergence quality and final accuracy justify the computational cost.

5. Significance

This paper represents a significant leap forward in Agentic RL for Text-to-SQL.

Bridging the Gap: It effectively bridges the gap between LLM reasoning capabilities and real-world database interactions by enabling iterative refinement.
Solving Credit Assignment: By mathematically guaranteeing convergence and eliminating oscillatory behaviors, it solves the fundamental credit assignment problem that has hindered multi-turn SQL agents.
Generalizability: The proposed reward mechanisms (CSMR and ATR) are domain-agnostic and could potentially be applied to other complex reasoning tasks requiring iterative tool use and sparse feedback.

In conclusion, SQL-ASTRA demonstrates that by moving beyond binary, outcome-only rewards and incorporating dense, process-oriented signals grounded in control theory, Agentic RL can achieve robust, state-of-the-art performance in complex Text-to-SQL tasks.

SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation

1. The "Partial Credit" Coach (CSMR)

2. The "Energy Meter" Coach (ATR)

The Grand Result

1. Problem Statement

2. Methodology: The Agentic SQL Framework

A. Column-Set Matching Reward (CSMR)

B. Aggregated Trajectory Reward (ATR)

C. Training Algorithm

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents