Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Imagine you have a very smart, but overly chatty student named LLM (Large Language Model). When you ask this student a hard math problem, they don't just give you the answer. Instead, they write out a massive, 10-page essay explaining every single thought they had, including all the wrong turns, the "wait, let me check that again" moments, and the repetitive double-checking.

This is called Chain-of-Thought (CoT) reasoning. While it helps the student get the right answer, it's incredibly slow, expensive to run (like burning a lot of fuel to drive a car), and often includes so much "fluff" that the student starts to confuse themselves.

The paper introduces a new training method called FGO (Fine-grained Group Policy Optimization) to teach this student how to be concise without losing their smarts.

Here is the breakdown using simple analogies:

1. The Problem: The "Over-Thinker" and the "Boring Teacher"

Currently, if you ask the student a question, they generate a group of possible answers (let's say 8 different drafts).

The Old Method (GRPO): The teacher looks at these 8 drafts. If a draft gets the right answer, the teacher gives them all a generic "Good job!" sticker. If they get it wrong, they get a "Try again" sticker.
- The Flaw: This is like a teacher who doesn't care how you got the answer. If you wrote a 10-page essay to get a "Good job," you get the same reward as someone who wrote a 1-sentence answer. So, the student keeps writing long essays. Also, if all 8 drafts are wrong, the teacher gives them all the same "Try again" sticker, so the student learns nothing about why they failed. This is called inefficient data utilization.
- The Second Flaw: Over time, the student gets scared to take risks. They all start writing the exact same boring, safe sentences. This is called entropy collapse (the student stops being creative or exploring new ways to think).

2. The Solution: The "Smart Coach" (FGO)

FGO is like a new, very observant coach who watches the student's 8 drafts and gives specific, tailored feedback based on two things: Length and Confidence.

The coach splits the 8 drafts into two teams:

Team Correct: The drafts that got the right answer.
Team Incorrect: The drafts that got the wrong answer.

How the Coach treats Team Correct (The Winners):

The coach says: "Great job getting the answer right! But, I want you to be faster."

The Rule: If you got it right, but you wrote a short, confident essay, you get a big gold star. If you got it right but wrote a 10-page ramble, you get a smaller star.
The Analogy: Imagine a race. If two runners finish first, the one who ran the most efficient path gets the bigger trophy. This encourages the student to cut out the fluff and "over-thinking" while keeping the correct logic.

How the Coach treats Team Incorrect (The Losers):

The coach says: "You got it wrong. But I want to see you try different things next time."

The Rule: If you got it wrong, the coach actually rewards you for being shorter and more exploratory (trying wild, different ideas).
The Analogy: Imagine a detective solving a crime. If the first 5 suspects are innocent, the detective shouldn't just keep asking the same 5 questions. They should try a new angle. The coach encourages the student to try short, different, "out-of-the-box" paths to find the right answer, rather than just repeating the same long, wrong path.

3. The Results: Shorter, Smarter, and Safer

By using this "Smart Coach" method, the paper shows that:

The essays get shorter: The student learns to cut out the "Wait, let me check that again" fluff. The answers are 30% to 50% shorter.
The answers stay correct: Because the coach still rewards the correct logic, the student doesn't lose their smarts. In fact, they often get better at math because they aren't getting confused by their own rambling.
No more "Boring" students: The student keeps exploring new ideas (high entropy) instead of just copying the same safe answer every time.
No wasted effort: Every single draft the student writes is used to teach them something, even the wrong ones.

Summary

Think of FGO as a personal trainer for a brain.

Old Training: "Good job, here is a cookie." (No matter how you did it).
FGO Training: "Good job! But next time, try to solve it in 3 steps instead of 10. And if you get it wrong, try a completely different shortcut."

The result is a super-intelligent AI that solves complex math problems quickly, efficiently, and without the annoying habit of over-explaining everything.

1. Problem Statement

Large Language Models (LLMs) equipped with Chain-of-Thought (CoT) reasoning often generate excessively verbose outputs. This "overthinking" leads to:

Increased Computational Costs: Higher latency and token usage without proportional performance gains.
Performance Degradation: Excessive length can introduce redundancy and errors (e.g., double-checking that leads to confusion).
Limitations of Existing Methods: Current CoT compression techniques fall into three categories, each with flaws:
- Token-level: Filters tokens but often breaks logical consistency.
- Instance-level: Relies on an auxiliary compressor model, making performance dependent on a second model.
- Chunk-level: Preserves reflection but incurs high computational overhead due to repeated segmentation.

Furthermore, the standard Reinforcement Learning (RL) algorithm used for post-training, Group Relative Policy Optimization (GRPO), suffers from two critical limitations in this context:

Inefficient Data Utilization: When all responses in a group receive the same reward (e.g., all correct or all incorrect), the advantage function becomes zero, rendering the data useless for gradient updates.
Entropy Collapse: During training, response entropy decreases sharply, causing the model to generate nearly identical, repetitive outputs, which exacerbates the data utilization issue.

2. Methodology: Fine-grained Group Policy Optimization (FGO)

The authors propose FGO, an enhanced RL algorithm that refines GRPO by subdividing response groups and applying fine-grained reward shaping based on length and entropy.

Core Mechanism

Given a question $s$ and a group of $G$ responses $\{o_i\}$ , FGO first verifies each response against the ground truth to assign a binary reward ($1 $for correct,$ 0$ for incorrect). It then splits the group into two subgroups:

Correct-Response Subgroup ( $G^+$ ): Responses where $r_i = 1$ .
Incorrect-Response Subgroup ( $G^-$ ): Responses where $r_i = 0$ .

Reward Shaping

FGO applies distinct reward shaping strategies to each subgroup to optimize for both accuracy and efficiency:

For Correct Responses ( $G^+$ ):
- Goal: Maintain accuracy while encouraging shorter, more confident reasoning.
- Mechanism: Assigns a base reward of $1 $. It then calculates a fine-grained weight$ W^+ $based on **Length ($ L $)** and **Entropy ($ H$)**.
- Formula: $W^+ = \text{Softmax}\left[ \left(\frac{\text{mean}(L^+)}{L^+}\right)^\alpha \times \left(\frac{\text{mean}(H^+)}{H^+}\right)^\beta \right]$ .
- Logic: Shorter responses (lower $L$ ) and lower-entropy (more confident) responses receive higher weights. The hyperparameter $\alpha$ controls the degree of length compression.
For Incorrect Responses ( $G^-$ ):
- Goal: Penalize errors but encourage exploration to find correct paths.
- Mechanism: Assigns a base reward of $-1$ (instead of $0$) to ensure the weight multiplier is effective.
- Formula: $W^- = \text{Softmax}\left[ \left(\frac{L^-}{\text{mean}(L^-)}\right)^\alpha \times \left(\frac{\text{mean}(H^-)}{H^-}\right)^\beta \right]$ .
- Logic: Unlike the correct group, shorter responses are not prioritized here. Instead, the formula encourages higher entropy (more exploration) and allows for longer reasoning to avoid premature convergence on wrong answers.

Optimization

The final reward set $R = \{R^+, R^-\}$ is used to compute the advantage function $A_{i,t}$ (omitting the standard deviation term for stability, as per Dr.GRPO). The policy is updated using the standard clipped objective function of GRPO but with these refined rewards.

3. Key Contributions

FGO Algorithm: A novel RL-based approach that effectively compresses long CoTs while preserving or improving reasoning performance.
Solving GRPO Limitations:
- Data Utilization: By splitting groups and applying differential weighting, FGO ensures 100% data utilization, eliminating the "zero advantage" problem of standard GRPO.
- Entropy Collapse: The relative entropy optimization between correct and incorrect groups prevents the model from collapsing into repetitive, low-entropy outputs.
Fine-Grained Reward Shaping: The integration of length and entropy signals allows for precise control over the trade-off between reasoning depth and brevity.

4. Experimental Results

The authors evaluated FGO on four reasoning LLMs (Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, ZR1-1.5B, and Qwen2.5-Math-1.5B-Instruct) across four benchmarks: MATH500, AIME24, AMC23, and Minerva.

Compression Efficiency: FGO achieved substantial reductions in token length (e.g., reducing average length from ~700 to ~300 tokens on MATH500) while maintaining or improving accuracy.
Accuracy Contribution per Token (ACT): FGO consistently outperformed Vanilla, GRPO, and TLDR methods in ACT, indicating higher efficiency in token usage.
Self-Reflection Preservation: Despite compression, FGO models retained the ability to self-reflect (identified by keywords like "wait," "hmm"), proving that compression did not strip the model of its reasoning capabilities.
GRPO Limitations Addressed:
- Data Utilization: Table 3 shows GRPO had thousands of invalid samples (zero advantage) across training sets, whereas FGO had 0 invalid samples.
- Entropy Dynamics: Training curves (Fig. 3) showed that FGO maintained higher entropy levels throughout training compared to GRPO, which suffered from rapid entropy collapse.
Hyperparameter Sensitivity: Ablation studies on $\alpha$ (length control) showed that $\alpha = 0.01$ offered the best balance between compression and accuracy, while $\alpha=1$ (aggressive compression) hurt accuracy.

5. Significance

This paper presents a significant advancement in making LLM reasoning more efficient and cost-effective. By addressing the fundamental flaws of GRPO (data inefficiency and entropy collapse), FGO enables the deployment of reasoning models that are:

Faster: Significantly reduced inference latency due to shorter CoT.
Cheaper: Lower token consumption reduces API costs and computational load.
Robust: Maintains high accuracy and self-correction capabilities, avoiding the pitfalls of "overthinking."

The work suggests that reasoning ability does not scale linearly with length; rather, quality and efficiency of reasoning are paramount. FGO provides a scalable framework to achieve this, making it highly relevant for deploying LLMs in resource-constrained environments.