Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

This paper introduces Sparsity Forcing, a reinforcement learning-based post-training framework that explicitly optimizes multimodal large language models to achieve significantly higher token reduction ratios (up to 75%) with minimal accuracy loss by treating efficiency and correctness as joint rewards during inference-consistent rollouts.

Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you have a brilliant but very chatty assistant (a Multimodal Large Language Model, or MLLM) who helps you solve problems by looking at pictures and videos. This assistant is incredibly smart, but they have a bad habit: when you show them a 10-minute video or a high-resolution photo, they try to read every single word and look at every single pixel before answering.

This is like asking a librarian to read every book in the library to find the one page that mentions "cats." It takes forever, costs a fortune in electricity, and fills up the librarian's desk (memory) until it collapses.

The paper you shared introduces a new training method called Sparsity Forcing. Here is how it works, explained simply:

The Problem: The "Over-Attentive" Assistant

Current AI models are great, but they are inefficient. They process too much "junk" data.

  • Existing methods try to be smart by saying, "Hey, let's only look at the top 50% of the most important words."
  • The issue: They stop there. They are too afraid to cut deeper. If you ask them to cut 80% of the data, they get confused and give wrong answers because they are just guessing which words to keep based on old habits.

The Solution: Sparsity Forcing (The "Strict Coach")

The authors created a new training technique using Reinforcement Learning (think of it as training a dog with treats and gentle corrections).

Here is the analogy:

1. The "Rollout" Game (The Simulation)
Instead of just teaching the model to be efficient once, the researchers play a game with the model.

  • They ask the model a question about a video.
  • They tell the model: "Okay, try to answer this, but this time, you are only allowed to look at 90% of the video."
  • Then they say: "Try again, but this time, only 50%."
  • Then: "Only 20%!"

2. The Reward System (The Scorecard)
The model gets points based on two things:

  • Did you get the answer right? (Performance)
  • How little did you look at? (Efficiency)

If the model gets the answer right while only looking at 20% of the video, it gets a huge reward.
If it gets the answer right but looked at 90% of the video, it gets a small reward (because it was lazy).
If it looked at 20% but got the answer wrong, it gets no points (because being fast but wrong is useless).

3. The "Group" Comparison (The Team Huddle)
The model doesn't just learn from one try. It tries many different "budgets" (20%, 50%, 90%) in a single session. The system then compares them:

  • "Hey, the version that looked at 20% got it right! That's the winner!"
  • "The version that looked at 90% also got it right, but it wasted time. You lose points for that."

Over time, the model learns: "Oh, I don't need to read the whole script to know the plot. I can skip the boring parts and still get an A+."

Why This is a Big Deal

  • It's Flexible: Unlike other methods that have a rigid rule (like "always skip every 3rd word"), this method learns to adapt. It knows that for a complex math problem, it needs to read more, but for a simple "what color is the car?" question, it can skip almost everything.
  • It's Safe: The model is anchored to a "reference" (a standard, non-spicy version of itself) so it doesn't go crazy and start hallucinating. It learns to be efficient without losing its smarts.
  • The Results:
    • They managed to cut the amount of data the model reads by 75% (from 100% down to 25%).
    • The model became 3.3 times faster.
    • It used 3 times less memory.
    • And the accuracy? It barely dropped at all.

The Bottom Line

Sparsity Forcing is like hiring a strict coach who forces your AI assistant to learn how to be a "speed reader." Instead of reading every word of a novel, the AI learns to skim the most important sentences, find the answer, and stop. It saves time, saves money, and still gets the job done perfectly.

This is a massive step forward for making AI run faster on regular computers and phones, rather than needing massive supercomputers to process a simple video.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →