Original authors: Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu

Published 2026-05-07

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Teaching AI to Solve Puzzles

Imagine you are training a robot to solve math problems or write code. You give it a prompt, and it tries to generate an answer. To teach it, you use a method called Reinforcement Learning with Verifiable Rewards (RLVR).

Think of this like a game show. The robot (the AI) generates several different answers (responses) to a single question. A referee (a simple computer program) checks them:

If the answer is correct, the robot gets a "thumbs up" (positive reward).
If it's wrong, the robot gets a "thumbs down" (negative reward).

The goal is to teach the robot to generate more "thumbs up" answers and fewer "thumbs down" ones. The paper focuses on a specific training method called GRPO, which is popular because it's simple and works well.

The Problem: How to Count the Votes

The core issue the paper tackles is a subtle but critical question: When the robot generates a group of answers, how do we calculate the "average lesson" to learn from?

The robot might generate 16 answers at once. Some are short (5 words), and some are long (500 words). Some are correct, and some are wrong. The training algorithm needs to combine all these individual words into one big "update" to improve the robot's brain.

There are two main ways people have been doing this, and the paper argues both have a hidden flaw:

1. The "Word-Count" Method (Token Aggregation)

How it works: You count every single word (token) from every answer and average them all together.
The Flaw (The "Long-Winded Villain"): Imagine a group of students taking a test.
- Student A gets the answer right but writes a very short, concise explanation (10 words).
- Student B gets the answer wrong but writes a massive, rambling essay (500 words).
- If you just count words, Student B's wrong answer has 50 times more "weight" in the average than Student A's correct answer.
- The Result: The AI gets confused. It thinks the long, wrong answers are more important because they take up more space. This is called "Sign-Length Coupling." The length of the answer accidentally changes the sign (positive or negative) of the lesson.

2. The "Per-Person" Method (Sequence Aggregation)

How it works: You first calculate the average lesson for each answer individually, and then you average those answers together.
The Flaw (The "Lazy Voter"): Using the same student example:
- Student A (Short, Correct) gets 1 vote.
- Student B (Long, Wrong) gets 1 vote.
- The Result: This fixes the "long-winded villain" problem. But now, it treats a 10-word answer exactly the same as a 500-word answer. If the AI learns a lot from a long, detailed explanation, this method ignores that extra effort. It "downweights" long responses, treating them as if they were just as simple as short ones.

The Solution: "Balanced Aggregation" (BA)

The authors propose a new method called Balanced Aggregation (BA). It's like a smart referee who fixes the flaws of both previous methods.

How it works:

Sort the Answers: First, the referee separates the answers into two piles: the "Good" pile (thumbs up) and the "Bad" pile (thumbs down).
Count Words Inside the Piles: Inside the "Good" pile, they count all the words and average them. Inside the "Bad" pile, they count all the words and average them.
Balance the Piles: Finally, they combine the two piles. But here's the trick: they don't just mix them randomly. They make sure the "Good" pile and the "Bad" pile have equal influence on the final decision, regardless of how many words are in each pile.

The Analogy:
Imagine a town council voting on a new park.

Old Method 1 (Word Count): People who talk the longest get the most votes, even if they are wrong.
Old Method 2 (Per-Person): Every person gets one vote, even if one person wrote a 50-page report and another just said "Yes."
Balanced Aggregation: The council splits into "Pro-Park" and "Anti-Park" groups. They average the arguments within each group. Then, they give the "Pro" group and the "Anti" group equal weight in the final decision, ensuring that the length of the arguments doesn't skew the result.

What Did They Find?

The researchers tested this new method on two different AI models (Qwen2.5-Math-7B and Qwen3-1.7B) using math and coding datasets.

Stability is Key: The old methods often worked well at the start but then crashed or became unstable later in training. The "Word-Count" method was especially unstable when the AI started writing very long, wrong answers.
Better Results: The Balanced Aggregation method consistently produced better final scores. It was more stable, meaning the AI learned steadily without wild swings in performance.
Why It Matters: The paper shows that the "best" way to train an AI depends on how much the length of the answers varies.
- If answers vary wildly in length, the "Word-Count" method can be risky.
- If the difference between "Good" and "Bad" answer lengths is huge, the "Per-Person" method can be unfair.
- Balanced Aggregation works well in both situations because it fixes the specific bias of each method.

The Takeaway

The paper concludes that how you "mix the ingredients" (aggregate the data) in AI training is not just a tiny technical detail; it's a major design choice that determines whether the AI learns effectively or gets confused. By simply separating the "good" and "bad" examples before averaging them, the authors created a method that is more robust, stable, and effective for teaching AI to reason and code.

Technical Summary: Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing reasoning and code generation in Large Language Models (LLMs), with Group Relative Policy Optimization (GRPO) being a widely adopted method due to its simplicity and lack of a separate critic. However, a critical design choice within GRPO remains underexplored: the aggregation rule for token-level policy gradient terms within a sampled group.

Current practices generally fall into two categories:

Sequence Aggregation: The default in standard GRPO, which averages token contributions within each response first, then averages across responses. This implicitly downweights longer responses because each sequence contributes equally regardless of token count.
Token Aggregation: Advocated by recent works like DAPO and Dr.GRPO, which averages the clipped objective directly over all tokens in the sampled group.

The paper identifies that these two rules induce systematically different optimization biases:

Token Aggregation introduces a sign-length coupling bias. The relative contribution of positive (advantage > 0) and negative (advantage < 0) samples depends not only on their normalized advantages but also on their average response lengths. If positive and negative responses have different length distributions, token aggregation can systematically amplify one side of the update, leading to unstable training dynamics.
Sequence Aggregation removes the sign-length coupling by assigning equal weight to each response. However, it introduces a sequence equal-weighting bias, where longer responses are implicitly downweighted because the loss is averaged per sequence rather than per token.

Neither approach is universally optimal; the effectiveness of each depends on the variance in response lengths and the gap in lengths between positive and negative samples.

Methodology: Balanced Aggregation (BA)

To address the tension between these biases, the authors propose Balanced Aggregation (BA), a simple drop-in replacement for the aggregation step in GRPO-style RLVR.

The core mechanism of BA involves a three-step process:

Partitioning: The sampled group of responses is split into two subsets based on the sign of their normalized advantages: a positive subset ( $S_+$ ) and a negative subset ( $S_-$ ).
Intra-Subset Averaging: Token-level means are computed separately within each subset. This retains the token-level averaging property within sign groups, avoiding the strong per-sequence equal weighting of standard sequence aggregation.
Inter-Subset Combination: The two subset losses are combined using weights proportional to the number of sequences in each subset ( $k/G$ for positive and $(G-k)/G$ for negative, where $k$ is the count of positive sequences).

Theoretical Justification:
In the standard binary-reward GRPO setting, this specific weighting scheme ensures that BA induces the same inter-sign balancing prefactor as sequence aggregation ( $\sqrt{k(G-k)}/G$ ). Consequently, BA preserves the sign-balance property of sequence aggregation (removing sign-length coupling) while avoiding the strong sequence equal-weighting effect that penalizes long responses. The paper also provides a generalized formulation for non-binary rewards where weights are determined by advantage mass rather than sequence count.

Key Contributions

Unified Analysis of Aggregation Bias: The paper provides a formal analysis demonstrating that loss aggregation in GRPO is not a benign implementation detail. It characterizes the specific "sign-length coupling" bias in token aggregation and the "sequence equal-weighting" bias in sequence aggregation.
Balanced Aggregation (BA): The proposal of BA as a simple, drop-in alternative that decouples sign and length biases. It performs token-level averaging within sign groups but balances the groups based on sequence counts.
Empirical Validation and Diagnostic Criteria: Extensive experiments showing that the relative effectiveness of token vs. sequence aggregation is governed by response length variance and the positive-negative length gap. The paper demonstrates that BA consistently outperforms both baselines across different models and datasets.

Experimental Results

The authors evaluated BA using Qwen2.5-Math-7B and Qwen3-1.7B on two training datasets (DAPO-17k and Polaris). Performance was measured across six benchmarks: Math-500, AIME 2024, AIME 2025, OlympicBench, Minerva-MATH, and LiveCodeBench.

Key Findings:

Training Stability: Token aggregation often leads to severe performance degradation in later training stages (high peak-to-last-step drop), whereas BA maintains robust last-step accuracy.
Model-Dependent Dynamics:
- On Qwen2.5-Math-7B (which exhibited larger response-length variation), token aggregation initially outperformed sequence aggregation, but BA surpassed both in peak and last-step performance.
- On Qwen3-1.7B (which exhibited a larger positive-negative length gap), sequence aggregation was more stable than token aggregation, but BA again achieved the highest peak and last-step metrics.
Loss Dynamics: Analysis of policy-gradient loss trajectories revealed that token aggregation causes massive drifts away from zero due to sign-length coupling, while BA and sequence aggregation remain stable near zero.
Overall Performance: BA consistently delivered stronger final performance and better training stability compared to standard token and sequence aggregation across all tested regimes.

Significance and Claims

The paper claims that aggregation is a first-class design choice in GRPO-style RLVR, rather than a minor implementation detail. The significance of the work lies in:

Stability: BA provides a more robust optimization signal that prevents the training collapse often observed with token aggregation in later stages.
Universality: Unlike token or sequence aggregation, which perform well only under specific length distribution conditions, BA is robust across different model sizes and datasets.
Design Principle: The work highlights that effective RLVR requires balancing inter-sign weighting (to prevent bias) without discarding within-sign token information (to preserve signal from long responses).

The authors conclude that Balanced Aggregation offers a simple yet effective solution to the inherent trade-offs in GRPO, leading to more stable optimization and improved final model performance in reasoning and coding tasks.

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO