Not all tokens are needed(NAT): token efficient reinforcement learning

Here is an explanation of the paper "Not All Tokens Are Needed" (NAT) using simple language and creative analogies.

The Big Problem: The "Over-Worked" AI Teacher

Imagine you are training a brilliant student (the AI) to solve complex math problems. The student writes out their entire thought process step-by-step on a giant whiteboard (this is called a "Chain of Thought").

In traditional Reinforcement Learning (RL), the teacher (the training algorithm) does something very strict: They read every single word the student wrote, from the first letter to the last, and grade every single one of them.

The Issue: As these problems get harder, the students write longer and longer essays. Reading and grading every single word takes a massive amount of time and energy (computing power).
The Bottleneck: Even if the student writes the answer quickly, the teacher gets stuck in the "grading phase." The teacher is so busy re-reading the boring parts (like "let's call this variable x" or "now we add 2 to both sides") that they can't move on to the next student. This makes training slow, expensive, and sometimes causes the computer to run out of memory (like a teacher trying to hold too many papers at once).

The paper asks a simple question: Do we really need to grade every single word to teach the student effectively?

The Solution: The "Smart Grader" (NAT)

The authors introduce a new method called NAT (Not All Tokens Are Needed). Instead of reading the whole essay, the teacher uses a clever trick to grade only a random selection of sentences while still knowing if the whole essay was good or bad.

Here is how it works, broken down into two main strategies:

1. The "Scatter Shot" Method (URS)

Imagine the teacher takes the student's essay and randomly circles words with a red pen.

They might circle the first word, skip the next ten, circle the 15th, skip the next five, etc.
The Catch: If they skip a word, they have to give it a "bonus point" in their mental math to make sure the final grade is fair. (This is called Horvitz-Thompson reweighting—a fancy way of saying "if we ignore a word, we must weight the ones we keep heavier so the total score stays accurate.")
The Result: The teacher saves time on grading (backpropagation), but they still had to read the whole essay to find the words to circle. So, they still use a lot of memory.

2. The "Cut the End" Method (RPC) - The Star of the Show

This is the paper's best idea. Instead of circling random words, the teacher decides to only read the first half of the essay and ignore the rest.

The Trick: To make this fair, the teacher doesn't always cut off the exact same spot. Sometimes they read the first 40%, sometimes 60%, sometimes 55%. It's a random cut.
Why it's genius:
1. Saves Reading Time: The teacher literally stops reading halfway through. They don't even look at the end.
2. Saves Memory: Because they stop reading early, they don't need to hold the whole essay in their head.
3. Stays Fair: Because the cut point is random, over thousands of students, the teacher sees every part of the essay eventually. The "bonus point" math (reweighting) ensures the final grade is just as accurate as if they had read the whole thing.

The Real-World Results

The researchers tested this on AI models (like Qwen3) solving hard math problems. Here is what happened:

Performance: The AI trained with the "Cut the End" method (RPC) learned just as well as the AI that read every single word. Their math scores were identical.
Speed: The training process became 29% faster.
Memory: The computer needed 18% less memory (RAM). This is huge because it means you can train bigger models on cheaper computers, or train them much faster on the same computers.

The Analogy: Learning to Drive

Imagine you are learning to drive a car.

Old Way: Your instructor sits in the passenger seat and critiques every single second of your drive. "You turned the wheel 2 degrees too much here," "You pressed the gas too hard there," "You blinked too slowly." It's exhausting for the instructor, and you can't drive many laps a day.
NAT Way: The instructor says, "I'm going to watch the first half of your drive randomly. If you crash at the end, I'll know you were driving poorly, but I'll only give you detailed feedback on the part I watched."
The Outcome: You still learn to drive perfectly because the feedback you do get is high-quality and statistically fair. But the instructor gets to watch 100 students a day instead of 50, because they aren't staring at the rearview mirror for the whole hour.

Why This Matters

This paper proves that efficiency doesn't have to mean sacrificing intelligence.

By realizing that not every "token" (word) in an AI's thought process is equally important for the learning step, the authors found a way to cut the cost of training AI in half without making the AI "dumber." It's like finding a way to build a skyscraper using half the bricks, but with a smarter blueprint, so the building is just as strong.

In short: We don't need to read the whole book to learn the lesson. Sometimes, reading a random chapter is enough to understand the story, and it saves us a lot of time.

Here is a detailed technical summary of the paper "Not All Tokens Are Needed: Token-Efficient Reinforcement Learning".

1. Problem Statement

Reinforcement Learning from Verifiable Rewards (RLVR) has become a cornerstone for improving Large Language Models (LLMs) in complex reasoning tasks (e.g., math, coding). However, scaling RLVR to long Chain-of-Thought (CoT) trajectories faces a critical efficiency bottleneck:

Full-Token Backpropagation: Standard RL algorithms (like GRPO) compute policy gradients and backpropagate through every generated token in a trajectory.
Resource Constraints: As CoT lengths increase, this full-token update consumes massive amounts of GPU memory (activation storage) and compute (FLOPs), often becoming the primary bottleneck rather than the generation phase.
Inefficiency: Not all tokens contribute equally to the learning signal. Many are "mechanical" continuations or boilerplate, while only a subset represents high-impact decision points.
The Gap: Existing optimizations focus on faster generation (rollouts), but the subsequent learning phase remains memory-bound and compute-intensive, limiting the ability to scale to longer contexts.

2. Methodology: The NAT Framework

The authors propose NAT (Not All Tokens are Needed), a unified framework that treats the token budget as a first-class optimization primitive. NAT updates the policy using only a selected subset of tokens while preserving the statistical correctness of the full-sequence learning signal.

Core Mechanism: Horvitz–Thompson (HT) Reweighting

To ensure that subsampling tokens does not introduce bias, NAT employs the Horvitz–Thompson estimator:

Token Masking: For each token $t$ in a trajectory, a binary mask $m_{i,t}$ is applied, where the token is included with probability $p_{i,t}$ .
Unbiased Estimation: The loss contribution of an included token is reweighted by $1/p_{i,t}$.
Theoretical Guarantee: The authors prove that this HT-corrected objective yields an unbiased estimator of the original full-sequence GRPO gradient. This allows the model to learn from a subset of tokens without distorting the optimization direction.

Two Token Selection Schemes

The paper instantiates NAT with two specific strategies:

Uniform Random Sampling (URS):
- Mechanism: Each token is independently sampled with a fixed probability $p$ (e.g., 0.5).
- Pros: Simple, plug-and-play.
- Cons: While it reduces backward pass compute, it does not reduce forward pass cost in causal Transformers. Because attention is causal, computing the log-prob for a token still requires processing all preceding tokens. Thus, activation memory remains high.
Random Prefix Cutting (RPC) – The Key Innovation:
- Mechanism: Instead of independent sampling, RPC samples a contiguous prefix of the trajectory (length $L_i$ ) and discards the suffix. The inclusion probability is derived from the distribution of the cutoff length.
- Advantages:
  - True Forward Truncation: Since the model only processes the prefix $1 \dots L_i $, both forward compute and activation memory scale with$ L_i $rather than the full length$ T_i$.
  - Unbiasedness: Unlike deterministic truncation (which always cuts the same percentage and introduces bias), RPC randomizes the cutoff point, ensuring every position has a non-zero probability of being included, satisfying HT requirements.
  - Variance Control: By enforcing a minimum prefix length (e.g., 100 tokens), the method avoids pathological short sequences and stabilizes the HT weights.

3. Key Contributions

Unified Framework (NAT): A principled approach to token-efficient RLVR that decouples reward evaluation (full sequence) from policy optimization (subset of tokens).
Theoretical Unbiasedness: Proof that HT reweighting provides an unbiased gradient estimator for any positive inclusion probabilities, establishing a rigorous link between token masking and faithful RL optimization.
RPC Strategy: A novel "Random Prefix Cutting" method that achieves savings in both forward and backward passes, unlike previous random masking techniques.
Empirical Validation: Demonstration that NAT matches full-token performance while significantly reducing resource usage.

4. Experimental Results

The authors evaluated NAT on mathematical reasoning benchmarks (MATH, AIME24, AIME25) using Qwen2.5-Math-7B and Qwen3-8B models with the GRPO algorithm.

Performance (Accuracy):
- RPC and URS achieved performance statistically equivalent to full-token GRPO across all benchmarks (overlapping 95% confidence intervals).
- Deterministic Truncation (cutting the last 50% deterministically) performed significantly worse, confirming that naive truncation introduces harmful bias.
Efficiency Metrics (Qwen3-8B):
- GPU Memory: RPC reduced peak GPU memory by ~18% (from 47.7 GB to 39.2 GB).
- Training Time: RPC reduced forward/backward training time (excluding inference) by ~29%.
- Token Usage: RPC effectively utilized only ~50-55% of the tokens per trajectory on average.
Entropy: Methods using unbiased sampling (URS, RPC) maintained entropy curves similar to full-token GRPO, whereas deterministic truncation showed higher, unstable entropy.

5. Significance and Impact

Orthogonal Optimization: NAT addresses the "learning" bottleneck, complementing existing "generation" optimizations (like speculative decoding or high-throughput inference engines). It can be combined with these systems for compound gains.
Scalability: By reducing the memory and compute cost per update, NAT enables the training of RLVR on longer CoT trajectories that were previously infeasible due to Out-of-Memory (OOM) errors.
Theoretical Insight: The paper challenges the assumption that every token must be backpropagated for effective learning, showing that stochastic subsampling with proper reweighting is sufficient.
Practical Deployment: The RPC method offers a "plug-and-play" solution that requires no changes to the reward computation or rollout pipeline, making it immediately applicable to existing RLVR pipelines.

In conclusion, NAT provides a theoretically grounded and empirically validated pathway to scale Reinforcement Learning for long-context reasoning, offering substantial efficiency gains (up to 29% time and 18% memory savings) without sacrificing model performance.