Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

Imagine you are training a team of student detectives (AI models) to solve a complex mystery. You give them a single clue at the end of the day: "Did you solve the case? Yes or No." This is a sparse reward—you don't tell them which specific step they got right or wrong, just the final result.

To help them learn, you use a method called Intra-Group Learning. You send out a group of 8 detectives to solve the same case. At the end, you compare them:

The 4 who solved it get a "Good Job" signal.
The 4 who failed get a "Try Again" signal.
The team learns by comparing the winners to the losers.

The Problem: The "Ghost Tax" and the "Echo Chamber"

The paper argues that while this works well at first, long-term training breaks down due to two invisible bugs:

The "Ghost Tax" (Learning Tax):
Imagine the detectives all say the same boring phrase at the start of their report: "The investigation began..." This phrase has nothing to do with solving the case. However, because the "winning" detectives happened to say it, and the "losing" ones didn't (or said it differently), the AI gets confused. It thinks, "Maybe saying 'The investigation began' is the secret to winning!"
The model starts wasting energy updating these boring, irrelevant words. It's like paying a "tax" on useless information. Over time, the model gets worse at solving the actual problem because it's too busy polishing the boring parts of its speech.
The "Echo Chamber" (Entropy Collapse):
Imagine there are two perfectly correct ways to solve the case:
- Solution A: "The butler did it."
- Solution B: "The butler committed the crime."
  Both are right. But because of how the math works, the AI might accidentally start favoring Solution A slightly more than Solution B. Next time, it favors A even more. Eventually, it stops generating Solution B entirely. It collapses into a single, repetitive way of speaking, losing its creativity and ability to explore different solutions.

The Root Cause: The "Coupled Chain"

Why does this happen? The paper uses a metaphor of a linked chain.

In many current AI methods, the "score" for a specific word is tied to the entire length of the story.

If Detective 1 wrote a long, rambling story that happened to be correct, their entire chain of words gets a high score.
If Detective 2 wrote a short, punchy story that was also correct, their chain gets a lower score.

Even though both solved the case, the math treats them differently. When the AI tries to cancel out the "noise" (the boring words) by comparing the two detectives, the math fails because the "chains" are linked. The noise from the long story doesn't perfectly cancel out the noise from the short story. The "Ghost Tax" accumulates, and the model drifts off course.

The Solution: "Decoupling the Chain"

The authors propose a simple but powerful fix: Break the chain.

Instead of letting the score of the whole story dictate the weight of every single word, they force the group to agree on a single, shared weight for the whole group before calculating the updates.

The Analogy:
Imagine the detectives are in a meeting room.

Old Way: Each detective shouts their score based on their own unique story length. The teacher tries to average them, but the math gets messy, and the teacher accidentally rewards the word "The" because Detective 1 said it 50 times.
New Way (The Paper's Fix): Before anyone speaks, the teacher says, "For this round, we are all using the same volume knob." If the group's average performance is good, everyone turns their volume up by the exact same amount. If it's bad, everyone turns it down.

By forcing the group to use the same "volume" (weight) for the shared parts of the story, the boring, irrelevant words (like "The" or "The investigation began") cancel each other out perfectly.

If Detective 1 and Detective 2 both said "The investigation began," and they are in the same group, the math now ensures that the "Good Job" signal from one cancels the "Try Again" signal from the other.
The result? The model stops wasting energy on the boring words. It only learns from the parts that actually matter: the clues and the solution.

The Results

When the researchers applied this "Decoupled" method:

Less Waste: The model stopped paying the "Ghost Tax." It learned faster because it wasn't distracted by irrelevant words.
More Stability: The training didn't crash or get jittery.
Better Performance: The model became smarter at math and coding tasks because it wasn't collapsing into a repetitive echo chamber.

In a Nutshell

Current AI training methods are like a teacher who accidentally rewards students for saying "Hello" because the smartest student happened to say it. This paper says, "Stop that!"

They found a mathematical rule: If you want an AI to learn from group comparisons, you must ensure that the "boring" parts of the conversation cancel each other out perfectly. If you don't, the AI gets confused, wastes time, and eventually stops thinking creatively. Their fix is a simple mathematical tweak that forces the AI to ignore the noise and focus only on the signal.

1. Problem Statement

The paper addresses critical stability issues in Reinforcement Learning (RL) for Large Language Models (LLMs) when using sparse termination rewards (e.g., only the final answer is correct/incorrect). While intra-group comparison methods (like GRPO, GSPO) have become the dominant paradigm for fine-tuning reasoning models, they suffer from three major failure modes during long-term training:

Ineffective Update Accumulation (Learning Tax): The model accumulates updates on tokens that are irrelevant to the reward (e.g., common template words), wasting compute and distorting the policy.
Solution Probability Drift: Semantically equivalent correct solutions (e.g., "10+10=20" vs. "The answer is 20") drift apart in probability, causing the model to favor one surface form over another arbitrarily.
Entropy Collapse: The policy distribution collapses to a few high-probability patterns, reducing exploration and reasoning diversity.

Core Hypothesis: The authors argue these issues are not merely due to hyperparameter tuning or noise, but stem from a structural limitation in how current algorithms handle token-level credit assignment. Specifically, existing methods disrupt the exchangeability of token updates, preventing the natural cancellation of gradients for shared, reward-irrelevant tokens.

2. Theoretical Analysis & Necessary Conditions

The paper establishes a unified perspective from token-level credit assignment to derive a necessary condition for stable learning.

The Concept of Gradient Cancellation: In an ideal intra-group setting, if multiple trajectories share a specific token (e.g., a common prefix) that carries no distinguishing information about the reward quality, the gradients for that token should sum to zero (cancel out) across the group. This prevents the model from learning spurious correlations on irrelevant tokens.
The Structural Failure:
- Sequence-Coupled Weights: Methods like GSPO use sequence-level importance ratios ( $s_i = \prod r_{i,t}$ ). Because the weight for a specific token depends on the entire trajectory (multiplicative coupling), the effective weights for shared tokens differ across trajectories even if the tokens themselves are identical. This breaks exchangeability, making gradient cancellation impossible except in degenerate cases (measure zero).
- Asymmetric Clipping: Even in token-factorized methods like GRPO, asymmetric clipping (where the clipping function depends on the sign of the advantage) breaks the symmetry required for cancellation.
Theoretical Proofs:
- Proposition 3.1: Proves that violating intra-group cancellation leads to strictly positive KL divergence (distribution drift) for shared tokens, causing "Learning Tax."
- Corollary 3.2: Demonstrates that without cancellation, the log-odds between equivalent solutions drift linearly over time, leading to entropy collapse.
- Corollary 3.3: Shows that under sequence-coupled weighting, non-cancellation is the structural norm, not the exception.

3. Methodology: Decoupled Group-Relative Gradient Estimator

To fix these structural issues without altering the core RL framework, the authors propose DFPO (Drift Fixing Policy Optimization). The method introduces minimal intra-group transformations applied to the trajectory weights before backpropagation.

Key Design Principle:
The method does not change the direction of token gradients or redefine advantages. Instead, it eliminates structural asymmetric terms caused by sequence coupling by enforcing gradient exchangeability within the group.

Two Proposed Transformations:

Group-Constant (Min-Replace):
- Replaces all trajectory weights in a group with the minimum weight of that group ( $\tilde{w}_i = \min_j w_j$ ).
- Effect: Forces all trajectories to share the same weight scale, ensuring that for any shared token, the coefficients are identical, allowing gradients to cancel out.
- Constraint: Applied with Stop-Gradient (the transformation is treated as a constant during backprop) to prevent the transformation itself from introducing new gradient couplings.
Adv-Orthogonal Reweighting (Orth-Proj):
- Projects the weight vector onto the subspace orthogonal to the advantage vector ( $\tilde{w} = w - \frac{A^T w}{\|A\|^2}A$ ).
- Effect: Minimizes the correlation between weights and advantages, suppressing systematic biases induced by sequence coupling while maintaining non-negativity constraints.

Implementation Note: The paper emphasizes that these transformations must be applied with Stop-Gradient. If the transformation is differentiable, it reintroduces gradient coupling and instability.

4. Experimental Results

The authors validated their approach on mathematical reasoning (HMMT25, AIME25) and code reasoning (LiveCodeBench) benchmarks using Qwen3 models (32B and 80B).

Compute-Matched Protocol: All methods were compared under identical compute budgets (same tokens generated, same update steps).
Performance Gains:
- DFPO (Min-Replace) and DFPO (Orth-Proj) significantly outperformed baselines (GSPO, GRPO, and a fixed GRPO variant).
- AIME25: DFPO achieved ~82.5% accuracy vs. 76.9% for GSPO (Qwen3-32B).
- LiveCodeBench: ~71.6% vs. 64.7%.
- HMMT25: ~61.4% vs. 55.8%.
Stability & Efficiency:
- Faster Convergence: DFPO reached fixed performance thresholds with fewer steps (higher sample efficiency).
- Reduced Oscillation: Training curves were smoother, with significantly lower "jitter" (second-order difference metric).
- Mechanism Verification: Experiments confirmed that DFPO reduced the "Asymmetry of Gradient Modulation" and lowered the "Energy on Frequent Tokens" (reducing learning tax).

5. Key Contributions

Structural Boundaries of Intra-Group Learning: Identified that token-level gradient exchangeability is a necessary condition for stable RL under sparse rewards. Violating this leads inevitably to learning tax and entropy collapse.
Unified Gradient Perspective: Provided a rigorous mathematical explanation distinguishing between exchangeable (stable) and non-exchangeable (unstable) objectives, explaining why different algorithms fail in similar ways.
Constructive Structural Fixes: Proposed minimal, model-agnostic transformations (Min-Replace, Orth-Proj) that restore the cancellation structure without changing the core algorithm logic.
Empirical Validation: Demonstrated that fixing the structural asymmetry leads to state-of-the-art performance, better stability, and higher sample efficiency.

6. Significance

This paper shifts the focus from tuning hyperparameters to structural algorithm design. It reveals that the instability in current RLHF/RL for reasoning is not a bug but a feature of sequence-coupled weighting. By enforcing gradient cancellation on shared tokens, the authors provide a principled way to eliminate "learning tax," thereby enabling more stable, efficient, and high-performing reasoning models. The findings suggest that future RL algorithms for LLMs must explicitly account for token-level exchangeability to avoid entropy collapse and distribution drift.

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

The Problem: The "Ghost Tax" and the "Echo Chamber"

The Root Cause: The "Coupled Chain"

The Solution: "Decoupling the Chain"

The Results

In a Nutshell

1. Problem Statement

2. Theoretical Analysis & Necessary Conditions

3. Methodology: Decoupled Group-Relative Gradient Estimator

4. Experimental Results

5. Key Contributions

6. Significance

More like this

Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking

Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals