Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

Imagine you are teaching a brilliant but slightly confused robot how to solve complex math problems. You want it to be smart, but you also want it to be efficient.

The paper introduces a new teaching method called T2T (Thickening-to-Thinning). It's based on a very human idea: how we actually learn.

The Core Idea: "Reading the Book Thick, Then Thin"

The authors use a famous Chinese metaphor from the mathematician Hua Luogeng:

Thickening (The "Messy" Phase): When you first encounter a difficult, unfamiliar problem, you don't just guess the answer. You read the book "thick." You explore every angle, write down messy notes, try different approaches, and maybe even make a lot of noise. You need space to figure it out.
Thinning (The "Polished" Phase): Once you finally understand the solution, you read the book "thin." You summarize the key points, throw away the messy drafts, and create a clean, concise cheat sheet so you can recall it instantly next time.

The Problem with Current AI:
Most AI training methods treat every correct answer the same, regardless of how long it took to get there. They also punish all long answers, even if the AI was struggling and needed to think longer to find the solution. It's like telling a student: "Whether you solved the hard problem by thinking for 10 minutes or the easy problem by guessing in 1 second, you get the same grade. But if you write too much, you lose points." This confuses the AI.

How T2T Works: The Two-Phase Reward System

T2T changes the rules of the game to mimic human learning. It uses a dynamic reward system that changes based on whether the AI is struggling or succeeding.

Phase 1: Thickening (When the AI is Wrong)

The Situation: The AI tries to solve a hard problem and gets it wrong.
The Old Way: The AI gets a "zero" score and tries again, maybe making the same mistake.
The T2T Way: The AI gets a special reward for being long and detailed.
- Analogy: Imagine the teacher says, "You got it wrong, but that's okay! Since you were struggling, I want you to write a longer explanation next time. Explore more paths! Don't be afraid to be messy."
- Result: This encourages the AI to "think harder" and explore more possibilities (search space) when it doesn't know the answer.

Phase 2: Thinning (When the AI is Right)

The Situation: The AI finally solves the problem correctly.
The Old Way: The AI gets a "perfect" score.
The T2T Way: The AI gets a bonus for being short and concise.
- Analogy: Now the teacher says, "Great job! You solved it. But you wrote a whole novel to do it. Next time, try to solve it in a few sentences. Cut out the fluff."
- Result: This forces the AI to refine its thinking, removing redundant words and creating a "crystallized" version of the solution.

Why This Matters

The paper tested this on some of the smartest math-solving AIs (like Qwen and DeepSeek) using hard math competitions (like AIME and AMC).

Better Results: The T2T models solved more problems correctly than standard models.
Smarter Exploration: When stuck, they didn't give up; they "thickened" their thinking to find a new path.
Efficiency: Once they knew the answer, they "thinned" their response, saving time and computing power.

The Big Picture

Think of T2T as a smart coach rather than a strict judge.

A strict judge just says "Right or Wrong."
A smart coach says, "When you're stuck, expand your thinking. When you've got it, condense your knowledge."

By teaching the AI to know when to be verbose and when to be brief, T2T helps the model learn faster, solve harder problems, and become a more reliable reasoning partner. It turns the chaotic process of learning into a structured journey from exploration to mastery.

Here is a detailed technical summary of the paper "Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning."

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard approach for enhancing reasoning in Large Language Models (LLMs), particularly in domains like mathematics and coding. However, existing RLVR methods face several critical limitations:

Entropy Collapse & Excessive Verbosity: Models often struggle to balance exploration (trying many paths) with exploitation (finding concise solutions). Standard methods either collapse into repetitive, low-entropy outputs or generate unnecessarily long, redundant reasoning chains.
Uniform Reward Schemes: Current approaches typically assign a uniform reward to all correct outputs and zero to incorrect ones. This fails to distinguish between the need for extensive search during the learning of difficult problems and the need for efficiency once a concept is mastered.
Entangled Dynamics: Existing methods treat exploration and consolidation as simultaneous goals, rather than separating them into distinct learning phases. This leads to suboptimal training dynamics where models either fail to explore hard problems or fail to compress learned knowledge.

2. Methodology: Thickening-to-Thinning (T2T)

The authors propose T2T, a dynamic reward shaping framework inspired by the human learning principle attributed to Hua Luogeng: "reading the book thick" (expansive exploration) followed by "reading it thin" (abstracting and compressing knowledge).

Core Mechanism

T2T modulates the reward signal based on the model's current competence (estimated success probability, $p$ ) on a specific query and the length of the response ( $\ell$ ). It implements a dual-phase mechanism:

Thickening Phase (Exploration):
- Condition: When the model is likely to fail on a query ( $p$ is low) or produces an incorrect answer.
- Action: The reward incentivizes longer trajectories.
- Goal: To broaden the search space, encourage the model to try diverse reasoning paths, and explore novel solutions for difficult problems.
- Reward Logic: Incorrect answers receive a reward proportional to their length ( $\alpha \cdot s_L(o) \cdot (1-p)$ ).
Thinning Phase (Consolidation):
- Condition: When the model has mastered the query ( $p$ is high) or produces a correct answer.
- Action: The reward imposes a penalty on length.
- Goal: To discourage redundancy, strip away unnecessary details, and crystallize reasoning into concise, efficient representations.
- Reward Logic: Correct answers receive a reward reduced by their length ($1 - \alpha \cdot s_L(o) \cdot p$).

Mathematical Formulation

The T2T reward function $R_{T2T}$ is defined as:
$R_{T2T}(q, o, \theta) = \begin{cases} 1 - \alpha s_L(o) p & \text{if } V(q, o) = 1 \text{ (Correct)} \\ \alpha s_L(o) (1 - p) & \text{if } V(q, o) = 0 \text{ (Incorrect)} \end{cases}$
Where:

$V(q, o)$ is the verifier (1 if correct, 0 otherwise).
$s_L(o)$ is a normalized length score.
$p$ is the on-policy estimated pass-rate for the query (competence).
$\alpha$ is a scaling factor ($0 < \alpha < 0.5$).

Key Design Features:

Competence-Conditioned: The weighting of length incentives depends dynamically on $p$ . When $p \ll 1$ , the "Thickening" term dominates; when $p \approx 1$ , the "Thinning" term dominates.
No Auxiliary Models: Unlike some baselines, T2T does not require token-level supervision or additional models; it operates purely at the sequence level using standard verifiers.
Preference Ordering: T2T induces a strict preference ordering: Correct Short > Correct Long > Incorrect Long > Incorrect Short. This ensures correctness is prioritized, but among correct answers, brevity is preferred, while among incorrect answers, length (exploration) is encouraged.

3. Key Contributions

Human-Inspired Learning Dynamics: The paper introduces a novel paradigm that explicitly separates the learning process into an "expansion" phase (for hard problems) and a "compression" phase (for mastered tasks), mimicking human cognitive progression.
Dynamic Reward Shaping: T2T provides a principled, lightweight method to align optimization objectives with the natural progression of reasoning, solving the "entropy collapse" and "verbosity" issues without complex architectural changes.
Theoretical Analysis: The authors prove that T2T creates a consistent preference ordering and acts as a second-order correction to the objective, effectively reshaping learning dynamics under finite sampling.
Integration: The method is seamlessly integrable into existing RLVR pipelines (like GRPO) without additional computational overhead during training.

4. Experimental Results

The authors evaluated T2T on mathematical reasoning benchmarks (MATH-500, AIME'24, AIME'25, AMC'23) using various models (Qwen-series and DeepSeek).

Performance Gains: T2T significantly outperformed standard GRPO and advanced baselines (LASER, W-REINFORCE, EntroPIC).
- On Qwen3-14B, T2T achieved the highest Pass@1 (85.1 vs. 83.3 for GRPO) and Pass@64 (92.7 vs. 88.8).
- On Qwen2.5-3B, T2T showed consistent improvements, particularly in Pass@64, indicating better exploration capabilities.
Training Dynamics:
- Entropy: T2T maintained higher policy entropy during training compared to baselines, preventing premature convergence (entropy collapse).
- Length Modulation: The model adaptively increased response length for difficult problems (Thickening) and decreased it for mastered tasks (Thinning).
Ablation Studies: Removing either the "Thickening" or "Thinning" component led to performance degradation, confirming that both phases are essential. Static length rewards (without difficulty awareness) performed worse, highlighting the importance of the dynamic, competence-aware mechanism.
Generalization: T2T demonstrated robust generalization to out-of-domain tasks (BBH for logic, HumanEval for code), showing no significant "alignment tax."

5. Significance

Compute-Efficient Reasoning: T2T offers a way to internalize "Test-Time Scaling" (TTS) into the training process. Instead of relying on expensive inference-time search (like Tree of Thoughts) for every query, T2T teaches the model to allocate its "thinking budget" dynamically: exploring deeply when uncertain and reasoning concisely when confident.
Mitigation of RL Pathologies: It directly addresses common RLVR failures such as entropy collapse (stagnation) and excessive verbosity (inefficiency) by structurally separating exploration and consolidation.
Scalability: The method is lightweight and scales effectively across different model sizes (from 1.5B to 14B), though it requires a minimum capacity (approx. 3B+) to effectively manage the complex "Thickening" phase without entering repetitive loops.

In conclusion, T2T represents a shift from static reward formulations to dynamic, stage-aware reward shaping, offering a more biologically plausible and computationally efficient path to enhancing LLM reasoning capabilities.