Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Imagine you are training a brilliant but slightly rigid student to solve complex math puzzles. You give them a problem, they try to solve it, and if they get it right, you give them a gold star. If they get it wrong, you tell them to try again. This is how Reinforcement Learning with Verifiable Rewards (RLVR) works for AI: it learns by trial and error, getting better at picking the "right" answer from the answers it already knows how to generate.

However, the authors of this paper noticed a problem: The student is hitting a ceiling.

They are getting really good at picking the best answer from the list of answers they already know, but they aren't learning to come up with new ways of thinking. It's like a chef who has mastered ten recipes perfectly but refuses to invent an eleventh one, even if the eleventh one would be delicious.

Here is the simple breakdown of their solution, PSN-RLVR, using some everyday analogies.

1. The Problem: The "Echo Chamber" Effect

Current AI training is like asking a student to solve a math problem by generating 100 different answers and picking the best one.

The Issue: The student keeps generating answers that sound very similar to each other. They are just rearranging the same old ideas.
The Result: If you ask for 100 answers, you get 100 slightly different versions of the same solution. You aren't discovering new strategies; you're just re-weighting old ones.

2. The Old Way of Trying New Things: "Shaking the Dice"

Previously, researchers tried to force the AI to be more creative by adding "noise" (randomness) to the words it chose.

The Analogy: Imagine the student is writing a story. To make it creative, you tell them, "Every time you pick a word, roll a die. If it's a 6, pick a random word instead."
The Flaw: This creates chaos. The story starts making no sense because the randomness happens word-by-word. The student forgets the plot halfway through because the "noise" broke the flow. In AI terms, this destroys the Chain of Thought (the logical flow of reasoning).

3. The New Solution: "The Twin Experiment" (Parameter-Space Noise)

The authors propose a smarter way to explore. Instead of shaking the words, they shake the student's brain (the model's internal settings) before they start thinking.

The Analogy: Imagine you have a main student (the AI) and a "Twin" student.
- Before the Twin starts solving the problem, you give them a pair of goggles with a slightly different tint.
- Because of the goggles, the Twin sees the problem slightly differently. They might think, "Oh, I should try this angle I never considered before!"
- Crucially, the goggles stay on the whole time. The Twin doesn't change their mind halfway through. They follow one consistent, slightly different strategy from start to finish.
Why it works: This creates consistent exploration. The AI tries a whole new "way of thinking" for the entire problem, rather than just stumbling randomly word-by-word. This preserves the logical flow (Chain of Thought) while still finding new paths.

4. The Two "Safety Nets"

Since the AI is now learning from the "Twin" (who sees things differently) but needs to update the "Main Student," there are two technical challenges. The authors added two clever fixes:

Safety Net #1: The "Translator" (Truncated Importance Sampling)
- The Problem: The Main Student might get confused if the Twin's answers are too weird. "Wait, why did you do it that way?"
- The Fix: The system acts like a translator. It says, "Okay, that answer was weird, but it was actually correct. Let's give it credit, but not too much credit, so we don't get confused." This keeps the training stable.
Safety Net #2: The "Smart Coach" (Adaptive Noise Scheduler)
- The Problem: How much tint should the goggles have? Too little, and the Twin isn't creative. Too much, and the Twin goes off the rails.
- The Fix: Instead of a human coach guessing, they built a Smart Coach.
  - If the AI is confident and boring (generating the same old answers), the Coach says, "Put on darker goggles! We need more exploration!"
  - If the AI is already struggling and confused, the Coach says, "Take the goggles off! Let's stick to what we know."
  - This happens in real-time, automatically adjusting the "creativity level" based on how the AI is doing.

5. The Result: Breaking the Ceiling

When they tested this new method (called PSN-GRPO) on hard math problems:

Standard AI: Could solve a problem if you gave it 10 tries.
Old "Shaking" AI: Got confused and did worse with 10 tries.
New "Twin" AI: When you gave it 256 tries, it didn't just pick the best of the old answers; it actually discovered new ways to solve the problem that the original AI never thought of.

Summary

The paper is about teaching AI to be a better explorer. Instead of randomly stumbling around (which breaks logic), they give the AI a "different perspective" for the whole journey. This allows the AI to find new, high-quality solutions that were previously hidden, especially when you give it plenty of time and attempts to solve a problem.

It's the difference between a student who memorizes the textbook and one who learns how to think outside the box, consistently and logically.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for improving Large Language Model (LLM) reasoning in domains like mathematics and code (e.g., DeepSeek-R1). However, the authors identify a critical "exploration ceiling":

Reweighting vs. Discovery: Current RLVR methods (like GRPO) primarily improve sampling efficiency (pass@1) by reweighting existing solution traces found in the base model's pretraining distribution. They fail to discover genuinely new reasoning strategies or traverse regions of the reasoning space that are unlikely under the initial policy.
Limitations of Existing Exploration:
- Action-Space Noise (Token-level): Techniques like temperature sampling inject stochasticity at the token level. This noise is uncorrelated across time steps, leading to "logical drift" and the degradation of long-horizon Chain-of-Thought (CoT) coherence.
- Objective-Level Regularization: Methods like entropy bonuses or pass@k optimization often rely on proxy signals that are sensitive to task difficulty and reward sparsity.
- Data Augmentation: Often incurs high computational costs or relies on external signals.

The core challenge is to induce temporally consistent, trajectory-level exploration that preserves global logical coherence while expanding the reasoning capability boundary (improving pass@k under large sampling budgets).

2. Methodology: PSN-RLVR

The authors propose PSN-RLVR (Parameter-Space Noise for RLVR), a framework that perturbs policy parameters before rollout generation rather than perturbing actions during decoding.

Core Components:

Parameter-Space Noise Injection:
- Instead of sampling tokens with noise, the method adds additive Gaussian noise to the policy parameters ( $\theta$ ) at the start of each iteration: $\tilde{\theta} = \theta + \epsilon$ , where $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ .
- This perturbed parameter set $\tilde{\theta}$ is held fixed for the entire rollout, inducing a temporally consistent exploration policy ( $\pi_{\tilde{\theta}}$ ). This ensures that the "agent" executes a consistent strategy throughout the reasoning chain, unlike token-level noise which causes incoherence.
- Optimal Injection Site: Experiments show that injecting noise specifically into MLP (Feed-Forward) layers yields the best trade-off between stability and exploration, outperforming noise injected into the LM head or all layers.
Truncated Importance Sampling (TIS):
- Problem: Since rollouts are generated by the noisy policy $\pi_{\tilde{\theta}}$ but the clean policy $\pi_{\theta}$ is updated, an off-policy distribution mismatch occurs, leading to biased gradients.
- Solution: The authors incorporate Truncated Importance Sampling into the GRPO objective. The importance ratio $w_t = \min(\frac{\pi_{\theta}}{\pi_{\tilde{\theta}}}, C)$ is used to weight the loss, stabilizing training and preventing unbounded variance when the policies diverge significantly.
Real-Time Adaptive Noise Scheduler:
- Challenge: Fixed noise scales are suboptimal; high noise hurts low-budget performance (pass@1), while low noise fails to explore high-budget boundaries (pass@256). Traditional KL-divergence-based scheduling is computationally expensive and suffers from feedback latency.
- Solution: A lightweight, real-time scheduler driven by a surrogate metric combining:
  - Semantic Diversity: Measured via cosine similarity of sentence embeddings between two probe rollouts.
  - Self-Certainty: Measured via the KL divergence between the model's token distribution and a uniform prior (higher certainty implies less exploration needed).
- The noise scale $\sigma$ is adjusted dynamically based on these signals to maintain an optimal exploration-exploitation balance without expensive KL calculations.

3. Key Contributions

PSN-RLVR Framework: The first systematic study of parameter-space noise for LLMs trained with verifiable rewards. It shifts exploration from the token level to the parameter level, enabling coherent long-horizon reasoning.
Stabilization Mechanisms: Introduction of TIS to handle off-policy mismatch and a computationally efficient, real-time adaptive noise scheduler that avoids the overhead of KL-based control.
Comprehensive Design Space Analysis: Extensive ablation studies identifying MLP layers as the optimal injection point and characterizing the scaling laws of noise magnitude.
Orthogonality: Demonstration that PSN is orthogonal to existing exploration methods (e.g., Pass@k training, negative reinforcement), allowing for compositional gains.

4. Experimental Results

The method was instantiated on GRPO and evaluated on models like Qwen2.5-Math-7B and Qwen3-4B across benchmarks (AIME 2024/25, AMC 2023, OlympiadBench, Minerva Math).

Expanded Reasoning Boundary: PSN-GRPO consistently outperforms standard GRPO and other exploration baselines (like Pass@k training) in large sampling budgets (pass@128, pass@256).
- Example: On AIME 2024, PSN-GRPO improved pass@256 by +8.9% over the best temperature-scaling baseline.
Diversity Metrics: The method significantly increases semantic diversity and operation diversity compared to standard RLVR, confirming it discovers new reasoning modes rather than just reweighting existing ones.
Coherence vs. Noise: Unlike action-space noise (temperature scaling), which degrades performance on long tasks due to logical drift, PSN maintains CoT coherence, with performance gaps widening as task length increases.
Adaptive Scheduling: The real-time scheduler (Variant II) outperforms both fixed-noise baselines and non-real-time schedulers, achieving the best balance between sample efficiency (pass@2) and high-budget capability (pass@256).
Qualitative Analysis: Case studies show PSN-GRPO solving problems where the base model fails on all 300 rollouts by discovering qualitatively different solution strategies (e.g., avoiding "invariance traps" in combinatorial problems).

5. Significance and Impact

Breaking the Exploration Ceiling: This work provides a practical mechanism to break the "distributional reweighting" limit of current RLVR, enabling models to find solutions in regions of the reasoning space previously inaccessible.
Efficiency: The proposed adaptive scheduler achieves robust performance with minimal computational overhead (~8% throughput reduction), making it viable for large-scale training.
Generalizability: The approach is model-agnostic and effective across different model families and difficulty levels, particularly for long-horizon reasoning tasks requiring global consistency.
Future Direction: It establishes parameter-space noise as a critical component for next-generation reasoning models, suggesting that future RLVR pipelines should move beyond token-level stochasticity to parameter-level perturbation for true discovery.

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

1. The Problem: The "Echo Chamber" Effect

2. The Old Way of Trying New Things: "Shaking the Dice"

3. The New Solution: "The Twin Experiment" (Parameter-Space Noise)

4. The Two "Safety Nets"

5. The Result: Breaking the Ceiling

Summary

1. Problem Statement

2. Methodology: PSN-RLVR

Core Components:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Operational Noncommutativity in Sequential Metacognitive Judgments

Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback