Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Imagine you are trying to teach a very smart, but inexperienced, apprentice how to solve a incredibly difficult puzzle. This is the challenge researchers face when trying to train small Artificial Intelligence (AI) models to solve complex math problems or write code.

The paper introduces a new teaching method called Supervised Reinforcement Learning (SRL). To understand why it's special, let's look at the two old ways of teaching, and why they often fail with hard problems.

The Two Old Ways (And Why They Stumble)

1. The "Copycat" Method (Supervised Fine-Tuning / SFT)

The Analogy: Imagine you give the apprentice a finished, perfect painting and say, "Copy this exactly, stroke by stroke."
The Problem: The apprentice learns to mimic the brushstrokes perfectly, but they don't understand why the painter put the blue there or the red there. If you ask them to paint a slightly different scene, they freeze. They have memorized the answer but haven't learned the logic. They are rigid and can't adapt.

2. The "Lottery" Method (Reinforcement Learning with Verifiable Rewards / RLVR)

The Analogy: Imagine you tell the apprentice, "Keep trying to solve this puzzle. If you get the final answer right, you get a gold star. If you get it wrong, you get nothing."
The Problem: If the puzzle is too hard, the apprentice might try 1,000 times and get it wrong every single time. They never get a gold star. Without that positive feedback, they don't know what they did wrong. They just keep spinning their wheels, getting frustrated, and learning nothing.

The New Solution: Supervised Reinforcement Learning (SRL)

The authors propose a third way that combines the best of both worlds. They call it SRL.

The Analogy: The "Step-by-Step Coach"
Instead of showing the apprentice the whole painting or just waiting for the final answer, the coach breaks the problem down into tiny, manageable steps.

The "Action" Breakdown: The coach takes the expert's solution and cuts it into logical chunks (e.g., "Step 1: Find the prime numbers," "Step 2: Group them").
The Inner Monologue: Before the apprentice makes a move, they are allowed to whisper their thoughts to themselves (the "inner monologue"). This is like the apprentice saying, "Okay, I think I need to multiply these numbers first."
The "Similarity" Reward: This is the magic sauce.
- The apprentice makes a move (an "action").
- The coach doesn't wait for the final answer. Instead, the coach looks at just that one step.
- The coach asks: "Does this step look and feel like what the expert would do?"
- The Reward: If the apprentice's step is similar to the expert's step, they get a partial score (a "good job" on this specific move). Even if the final answer is wrong, getting the steps right gives them a reward.

Why This Changes Everything

Think of it like learning to drive a car on a steep, foggy mountain road.

Old Way (SFT): You memorize the route. If the road changes, you crash.
Old Way (RL): You drive blind. If you don't reach the bottom, you get no feedback. You crash and don't know if you turned too early or too late.
SRL Way: A co-pilot sits next to you. Every time you turn the wheel, they say, "Good turn! That's exactly how we should have turned here." Even if you eventually miss the exit, you learned how to steer correctly.

Because the AI gets feedback on every single step, it never gets stuck in the "fog." It learns the logic of the journey, not just the destination.

The Results: What Happened?

The researchers tested this on:

Hard Math Problems: Like those found in national competitions (AMC, AIME).
Software Engineering: Fixing bugs in complex code.

The Outcome:

Small AI models (which usually fail at these hard tasks) suddenly started solving them.
The models didn't just memorize answers; they started "thinking" in a flexible way, checking their work, and adjusting their plans mid-solution.
The best strategy was to use SRL first to teach the steps, and then use the "Lottery" method (RL) later to polish the final answers. It's like learning the scales on a piano before trying to play a concerto.

In a Nutshell

SRL is a training framework that stops AI from guessing blindly or just copying blindly. Instead, it acts like a patient coach who breaks big, scary problems into small steps, praises the AI for getting the steps right, and lets the AI think out loud before acting. This allows even small, open-source AI models to tackle problems that were previously impossible for them.

` tags) followed by the action (the actual solution step).
* This allows the model to develop its own flexible reasoning style while adhering to the structural logic of the expert.

Dense Sequence Similarity Rewards:
- Unlike RLVR, which provides a binary reward (correct/incorrect) only at the end, SRL provides a dense reward at every step.
- The reward is calculated based on the sequence similarity between the model's generated action and the expert's ground-truth action for that specific step.
- Metric: The authors use Python's difflib.SequenceMatcher to compute a similarity ratio $R \in [0, 1]$ $R \in [0, 1]$ .
  - $R = \frac{2 \times \text{Matched Elements}}{\text{Total Elements}}$ .
- If the output format is invalid, a negative reward (-1) is assigned.
- Crucial Distinction: The reward is computed only on the logical action, not the internal monologue. This encourages the model to align its actions with the expert while maintaining freedom in its reasoning process.
Dynamic Sampling:
- To ensure effective learning, the framework filters out training samples where the rollouts yield near-zero variance in rewards (i.e., all rollouts are equally bad or equally good), ensuring the policy receives meaningful gradient updates.
Training Pipeline:
- The model is optimized using the GRPO (Group Relative Policy Optimization) objective, but with the step-wise similarity reward instead of final answer accuracy.
- Curriculum Strategy: The paper demonstrates that initializing training with SRL and then refining with RLVR (SRL $\to$ RLVR) yields the strongest performance.

3. Key Contributions

Novel Framework (SRL): Introduced a hybrid approach that bridges SFT and RL by using dense, step-wise supervision derived from expert trajectories, specifically designed for datasets where correct solutions are rare.
Granular Guidance: Demonstrated that breaking down reasoning into "actions" with sequence similarity rewards is superior to holistic imitation (SFT) or outcome-based rewards (RLVR).
Emergent Reasoning Behaviors: Showed that SRL-trained models develop sophisticated reasoning patterns, such as interleaved planning and verification (e.g., planning a step, executing it, then verifying before moving to the next), rather than generating a single monolithic block of text.
Generalization: Validated the method across two distinct domains: Mathematical Reasoning and Agentic Software Engineering.

4. Experimental Results

A. Mathematical Reasoning (Qwen2.5-7B)

Evaluated on benchmarks: AMC23, AIME24, AIME25, and Minerva Math.

Baselines: SFT (on full traces), SFT (on outlines), RLVR (GRPO), and R3 (reverse curriculum RL).
Findings:
- Direct SFT on the challenging s1K dataset caused performance degradation compared to the base model.
- RLVR provided marginal gains but struggled with the difficulty of the data.
- SRL Performance: SRL alone improved the average score by 3.0% over RLVR.
- SRL $\to$ RLVR Pipeline: Achieved the highest performance, improving the average score by 3.7% over RLVR and significantly outperforming all baselines.
- On AIME24, SRL achieved 16.7% (Greedy) vs. RLVR's 10.0% and R3's 13.3%.

B. Software Engineering (Qwen2.5-Coder-7B)

Evaluated on SWE-Bench-Verified (Oracle and End-to-End settings).

Baselines: Base model and SWE-Gym-7B (an SFT-based model).
Findings:
- In the Oracle setting (model given the correct files), SRL achieved a 14.8% resolve rate, a 74% relative improvement over SWE-Gym-7B (8.4%).
- In the End-to-End setting (model must find and fix files), SRL achieved 8.6%, doubling the performance of SWE-Gym-7B (4.2%).

C. Ablation Studies

Dynamic Sampling: Removing samples with low reward variance significantly improved performance.
Granularity: Multi-step decomposition (SRL) outperformed "One-step" sequence similarity rewards, proving that fine-grained guidance is essential.
Model Size: The method also yielded consistent gains on smaller models (Qwen2.5-3B).

5. Significance and Impact

Solving the "Hard Data" Problem: SRL provides a viable training path for small models on difficult datasets where traditional RL fails due to sparse rewards and SFT fails due to overfitting.
Scalability: The use of sequence similarity allows for rapid reward calculation without requiring a verifier for every intermediate step, making it scalable to large datasets.
Versatility: The framework is domain-agnostic, successfully applied to both abstract mathematical logic and concrete software engineering tasks.
Future Direction: The paper suggests that SRL can serve as a robust initialization for RLVR, creating a powerful curriculum learning strategy for training next-generation reasoning agents.

In summary, Supervised Reinforcement Learning (SRL) represents a significant advancement in training LLMs for complex reasoning by shifting the focus from "getting the right answer" to "performing the right steps," thereby unlocking the potential of small models on challenging tasks.

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

The Two Old Ways (And Why They Stumble)

The New Solution: Supervised Reinforcement Learning (SRL)

Why This Changes Everything

The Results: What Happened?

In a Nutshell

3. Key Contributions

4. Experimental Results

A. Mathematical Reasoning (Qwen2.5-7B)

B. Software Engineering (Qwen2.5-Coder-7B)

C. Ablation Studies

5. Significance and Impact

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá