Not all tokens are needed(NAT): token efficient reinforcement learning

The paper introduces NAT (Not All Tokens Are Needed), a token-efficient reinforcement learning framework that utilizes unbiased partial-token gradient estimation via Horvitz-Thompson reweighting to achieve full-sequence performance with significantly reduced compute and memory costs by updating policies on only a subset of generated tokens.

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang

Published 2026-03-10
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Not All Tokens Are Needed" (NAT) using simple language and creative analogies.

The Big Problem: The "Over-Worked" AI Teacher

Imagine you are training a brilliant student (the AI) to solve complex math problems. The student writes out their entire thought process step-by-step on a giant whiteboard (this is called a "Chain of Thought").

In traditional Reinforcement Learning (RL), the teacher (the training algorithm) does something very strict: They read every single word the student wrote, from the first letter to the last, and grade every single one of them.

  • The Issue: As these problems get harder, the students write longer and longer essays. Reading and grading every single word takes a massive amount of time and energy (computing power).
  • The Bottleneck: Even if the student writes the answer quickly, the teacher gets stuck in the "grading phase." The teacher is so busy re-reading the boring parts (like "let's call this variable x" or "now we add 2 to both sides") that they can't move on to the next student. This makes training slow, expensive, and sometimes causes the computer to run out of memory (like a teacher trying to hold too many papers at once).

The paper asks a simple question: Do we really need to grade every single word to teach the student effectively?

The Solution: The "Smart Grader" (NAT)

The authors introduce a new method called NAT (Not All Tokens Are Needed). Instead of reading the whole essay, the teacher uses a clever trick to grade only a random selection of sentences while still knowing if the whole essay was good or bad.

Here is how it works, broken down into two main strategies:

1. The "Scatter Shot" Method (URS)

Imagine the teacher takes the student's essay and randomly circles words with a red pen.

  • They might circle the first word, skip the next ten, circle the 15th, skip the next five, etc.
  • The Catch: If they skip a word, they have to give it a "bonus point" in their mental math to make sure the final grade is fair. (This is called Horvitz-Thompson reweighting—a fancy way of saying "if we ignore a word, we must weight the ones we keep heavier so the total score stays accurate.")
  • The Result: The teacher saves time on grading (backpropagation), but they still had to read the whole essay to find the words to circle. So, they still use a lot of memory.

2. The "Cut the End" Method (RPC) - The Star of the Show

This is the paper's best idea. Instead of circling random words, the teacher decides to only read the first half of the essay and ignore the rest.

  • The Trick: To make this fair, the teacher doesn't always cut off the exact same spot. Sometimes they read the first 40%, sometimes 60%, sometimes 55%. It's a random cut.
  • Why it's genius:
    1. Saves Reading Time: The teacher literally stops reading halfway through. They don't even look at the end.
    2. Saves Memory: Because they stop reading early, they don't need to hold the whole essay in their head.
    3. Stays Fair: Because the cut point is random, over thousands of students, the teacher sees every part of the essay eventually. The "bonus point" math (reweighting) ensures the final grade is just as accurate as if they had read the whole thing.

The Real-World Results

The researchers tested this on AI models (like Qwen3) solving hard math problems. Here is what happened:

  • Performance: The AI trained with the "Cut the End" method (RPC) learned just as well as the AI that read every single word. Their math scores were identical.
  • Speed: The training process became 29% faster.
  • Memory: The computer needed 18% less memory (RAM). This is huge because it means you can train bigger models on cheaper computers, or train them much faster on the same computers.

The Analogy: Learning to Drive

Imagine you are learning to drive a car.

  • Old Way: Your instructor sits in the passenger seat and critiques every single second of your drive. "You turned the wheel 2 degrees too much here," "You pressed the gas too hard there," "You blinked too slowly." It's exhausting for the instructor, and you can't drive many laps a day.
  • NAT Way: The instructor says, "I'm going to watch the first half of your drive randomly. If you crash at the end, I'll know you were driving poorly, but I'll only give you detailed feedback on the part I watched."
  • The Outcome: You still learn to drive perfectly because the feedback you do get is high-quality and statistically fair. But the instructor gets to watch 100 students a day instead of 50, because they aren't staring at the rearview mirror for the whole hour.

Why This Matters

This paper proves that efficiency doesn't have to mean sacrificing intelligence.

By realizing that not every "token" (word) in an AI's thought process is equally important for the learning step, the authors found a way to cut the cost of training AI in half without making the AI "dumber." It's like finding a way to build a skyscraper using half the bricks, but with a smarter blueprint, so the building is just as strong.

In short: We don't need to read the whole book to learn the lesson. Sometimes, reading a random chapter is enough to understand the story, and it saves us a lot of time.