Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

Imagine you are teaching a robot how to solve a complex logic puzzle, like a "Zebra Puzzle" (where you have to figure out who owns the zebra based on a list of clues).

Usually, when we teach robots (AI models) to solve these puzzles, we do two things:

Show them examples: We let them read through the puzzle and the answer.
Give them a grade: If they get the final answer right, we give them a gold star. If they get it wrong, no star.

The Problem:
The paper points out a flaw in this method. If you only care about the final gold star, the robot might get lucky. It might guess the right answer by stumbling around in the dark, or it might solve the puzzle in a weird, chaotic order that doesn't make logical sense. It gets the "what" (the answer) but misses the "how" (the logical path).

The Experiment:
The researchers asked: What if we gave the robot a tiny, subtle hint about the "order" in which a human expert would solve the puzzle, without actually showing it the expert's steps?

They tried this on a robot that had been trained on random puzzle solutions (where the steps were shuffled like a deck of cards). Then, they used a special training technique called RL (Reinforcement Learning) to tweak the robot's brain.

The "Secret Sauce": The Mixed Rewards
Instead of just giving a gold star for a correct answer, they created a two-part score:

The "Solved" Reward (The Gold Star): Did you get the right answer? (Yes = 1 point, No = 0 points).
The "Order" Reward (The Nudge): Did you fill in the puzzle pieces in a logical, step-by-step order, similar to how a human solver would?

The Creative Analogy: The Maze Runner
Imagine the robot is a runner in a maze.

Old Method: The runner gets a prize only if they reach the exit. They might run into walls, run in circles, or take a crazy shortcut that works by accident.
New Method: The runner still gets the prize for reaching the exit, BUT they also get a tiny "warm glow" every time they take a step that looks like a sensible path a human would take.

Even if the runner doesn't know the exact map, that "warm glow" (the ordering hint) gently steers them away from crazy loops and toward a logical path.

The "Bootstrapped Scaling" (The Volume Knob)
Here is the tricky part: The "Gold Star" is a huge number (1 or 0), but the "Warm Glow" for order is a tiny number (like 0.5). If you just add them together, the Gold Star drowns out the Warm Glow.

The researchers invented a clever "Volume Knob" (called Bootstrapped Scaling). Before the training started, they measured how loud each reward was and turned the knobs so that the Gold Star and the Warm Glow were perfectly balanced. This way, the robot could hear both signals clearly.

The Results
The results were surprising and exciting:

The robot trained only on random, shuffled steps (no logical order shown to it).
When they turned on the "Order Nudge" during training, the robot got significantly better at solving the puzzles.
The best result came when the robot was 99% focused on getting the answer right and only 1% focused on the order.

The Big Takeaway
You don't need to rewrite the robot's entire textbook or show it perfect examples to teach it logic. You can just give it a tiny, scalar hint (a whisper) about the right order of operations while it's practicing.

It's like telling a student, "You got the math problem right, but next time, try to write the steps in the order a teacher would." Even if you don't show them the teacher's notebook, just that tiny hint helps them learn the process, not just the answer.

Why does this matter?
This suggests we can make AI smarter and more logical without needing massive new datasets or changing the AI's architecture. We just need to tweak the "reward system" to care a little bit about how the AI thinks, not just what it thinks.

1. Problem Statement

Reinforcement Learning (RL) post-training for Large Language Models (LLMs) typically optimizes a single scalar objective, such as task success (e.g., "did the model solve the puzzle?"). This approach often ignores the structural properties of the solution space, specifically the temporal order in which intermediate actions are taken.

The authors investigate a critical question: Can injecting a scalar hint regarding a "canonical" solver ordering during RL post-training improve performance, even if the model was initially fine-tuned on randomized solution sequences?

The study focuses on Zebra puzzles (logic grid puzzles), which serve as a deterministic environment with latent state transitions. Previous work suggests that models trained on "solver-ordered" data learn better reasoning capabilities than those trained on randomized data. However, curating solver-ordered data is expensive. This paper asks if the model can learn this structure implicitly through reward shaping during RL, without modifying the supervised fine-tuning (SFT) data or the model architecture.

2. Methodology

2.1 Experimental Setup

Task: Zebra puzzles (Einstein's puzzles). Each puzzle requires filling a grid with 9 specific actions (triplets of row, column, value).
Data Variants:
- Solver-Order: The chronological sequence a deterministic, human-like solver uses to fill the grid.
- Random-Order: The same set of actions shuffled uniformly at random.
Model Architecture: A GPT-2 style Transformer (4 layers, 4 attention heads, hidden size 256), trained from scratch.
Training Pipeline:
1. Supervised Fine-Tuning (SFT): The model is trained on randomized solution orders using standard causal language modeling.
2. RL Post-Training: The SFT model is further optimized using Group Relative Policy Optimization (GRPO).

2.2 Reward Design

The core innovation lies in the reward function design, which combines two signals:

Sparse Solved Reward ( $R_{solve}$ ):
- A binary signal (1 or 0).
- Returns 1 only if the model produces a fully correct solution (all ground-truth triplets present with correct values and no conflicts).
- Ignores the order of generation.
Ordering Reward ( $R_{order}$ ):
- A dense, shaping signal that measures alignment with the canonical solver order, independent of value correctness.
- For each cell $(r, c)$ emitted by the model, it calculates the inverse of the absolute difference between the canonical index ( $\pi^\star$ ) and the model's emission index ( $\hat{\pi}$ ):
  $r(r, c) = \frac{1}{1 + |\pi^\star(r, c) - \hat{\pi}(r, c)|}$
- The final $R_{order}$ is the average over all uniquely emitted cells.
Combined Reward & Bootstrapped Scaling:
- The total reward is a weighted sum: $R_{total} = \alpha \cdot R_{solve} + (1-\alpha) \cdot R_{order}$ .
- Bootstrapped Scaling: To prevent one reward component from dominating due to magnitude differences (e.g., $R_{solve}$ is binary while $R_{order}$ is continuous), the authors compute the mean rewards ( $\bar{R}_{solve}, \bar{R}_{order}$ ) on a validation set using the frozen SFT model. They then apply global scalars:
  $SOLVESCALE = \frac{\alpha}{\bar{R}_{solve}}, \quad ORDERSCALE = \frac{1-\alpha}{\bar{R}_{order}}$
- These scalars are fixed for the entire GRPO training phase, ensuring the intended mixture ratio is maintained at initialization regardless of raw magnitudes.

3. Key Contributions

Scalar Hint Injection: Demonstrates that a scalar reward hinting at temporal structure (solver order) can effectively steer RL post-training toward canonical trajectories, even when the model has never seen ordered data during SFT.
Bootstrapped Scaling Procedure: Introduces a simple, automated method to normalize heterogeneous reward magnitudes, enabling controlled mixture studies without manual hyperparameter tuning of reward scales.
Empirical Evidence: Provides evidence that coarse ordering signals, when mixed with correctness rewards, significantly improve RL post-training accuracy on logic puzzles.

4. Results and Analysis

Baseline: The model fine-tuned on random orders achieved a test puzzle accuracy of 0.279.
Task-Only Optimization: Post-training with only the $R_{solve}$ reward (1:0 mixture) improved accuracy to 0.326.
Mixed Reward Optimization:
- All mixtures including a non-zero ordering component outperformed the task-only baseline.
- Best Performance: The optimal mixture was 0.99 (Solve) : 0.01 (Order), achieving an accuracy of 0.363.
- Robustness: Even small ordering weights (e.g., 0.95:0.05, 0.9:0.1) yielded significant gains (0.352–0.355).
Key Insight: A very small weighting toward the ordering reward (1%) was sufficient to provide a clear performance boost, suggesting the ordering signal acts as an effective "light shaping" term that guides the policy toward better reasoning trajectories without overwhelming the primary correctness objective.

5. Significance and Conclusion

Structural Bias without Data Curation: The paper proves that structural biases (like the order of reasoning steps) can be injected into LLMs via RL reward shaping alone. This eliminates the need to curate expensive, high-quality "solver-order" datasets for supervised fine-tuning.
Modular Post-Training: The approach offers a "cheap, modular knob" for improving model reasoning. Practitioners can take a model fine-tuned on noisy/random data and use a small ordering reward to steer it toward more canonical, human-like reasoning paths.
World Model Implications: By encouraging the model to follow a specific temporal order of state transitions, the method implicitly encourages the model to maintain an internal "world model" of valid next moves, aligning with the workshop's theme of understanding and scaling world models.

Limitations: The study is currently limited to a single task (Zebra puzzles) and a small model architecture. The authors note that the fixed bootstrapped scaling factors might become miscalibrated as the policy improves during training, suggesting future work should explore dynamic scaling updates.

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

1. Problem Statement

2. Methodology

2.1 Experimental Setup

2.2 Reward Design

3. Key Contributions

4. Results and Analysis

5. Significance and Conclusion

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation