Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective

This paper demonstrates that Reinforcement Fine-Tuning (RFT) outperforms Supervised Fine-Tuning (SFT) in preserving prior knowledge for multimodal large language models by leveraging training data with smaller influence magnitudes and better alignment to the base model's probability landscape, thereby mitigating catastrophic forgetting while enabling effective task adaptation.

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen

Published 2026-03-06
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective" using simple language and creative analogies.

The Big Picture: Learning a New Trick Without Forgetting the Old Ones

Imagine you have a brilliant, well-read librarian (the AI model) who knows everything about history, science, and art. You want to teach this librarian a brand-new skill: solving a jigsaw puzzle.

The paper asks a critical question: How do we teach this new skill without making the librarian forget everything they already know?

The researchers compared two ways of teaching:

  1. SFT (Supervised Fine-Tuning): The "Teacher-Student" method. You give the librarian the puzzle pieces and the exact answer key. "Put piece 1 here, piece 2 there."
  2. RFT (Reinforcement Fine-Tuning): The "Trial-and-Error" method. You let the librarian try to solve the puzzle, and you only give them a "Good job!" (reward) if they get it right, or a "Try again" if they fail.

The Shocking Discovery

The researchers found a sharp trade-off:

  • SFT is fast but dangerous. The librarian learns the puzzle quickly (in just a few hours), but immediately after, they start forgetting how to read history books or identify famous paintings. It's like they are so focused on the new puzzle that their brain "wipes" the old library shelves.
  • RFT is slow but safe. The librarian takes a long time to figure out the puzzle (weeks of practice), but once they learn it, they still remember everything about history and art. Their old knowledge remains intact.

Why does this happen? The paper argues it's not because RFT is a "smarter" algorithm. It's because of what data the librarian is studying.


The "Data Perspective": The Library Analogy

To understand why RFT is safer, the researchers looked at the "data" (the study materials) used in both methods.

1. The "Foreign Language" Problem (SFT)

When using SFT, humans often write out the answers for the AI.

  • The Analogy: Imagine the librarian is trying to learn a new language. If you force them to memorize a dictionary written in a completely foreign, complex dialect they've never heard before, their brain gets confused. To make room for this new, confusing information, their brain accidentally deletes the old, familiar stories they know.
  • In the paper: The "Non-Reasoning" data (just the answer) is like that foreign dialect. It creates a huge "shock" to the system, causing Catastrophic Forgetting.

2. The "Native Speaker" Advantage (RFT)

When using RFT, the AI generates its own attempts (rollouts) and learns from its own mistakes.

  • The Analogy: Instead of forcing a foreign dialect on the librarian, you let them practice speaking in their own natural voice. They might stumble at first, but the words they use are already part of their vocabulary. Because the new puzzle pieces are being assembled using words and logic the librarian already understands, their brain doesn't feel the need to delete old memories to make space.
  • In the paper: The data generated by RFT stays in a "low perplexity" zone. This means the AI is learning from examples that feel natural and familiar to its existing knowledge, rather than jarring, alien examples.

The "Jigsaw Puzzle" Test

Why use jigsaw puzzles?

  • Imagine asking a super-smart person who has read every book in the world to solve a 3x3 jigsaw puzzle. They would likely fail because no one has ever taught them how to do that specific puzzle before.
  • The researchers used this as a "clean slate" test. Since the AI has zero prior knowledge of jigsaw puzzles, any learning is brand new. This allowed them to see clearly: Did learning this new thing make the AI forget its old knowledge?

The "Magic" Solution: Copying the Winner

The most exciting part of the paper is a "hack" they discovered.

They realized that the RFT method works so well because it generates high-quality, natural-looking practice data. So, they tried this:

  1. Let the AI practice with RFT for a little while (just enough to get some good examples).
  2. Take those examples (the AI's own "thinking process" and correct answers).
  3. Use those examples to train the AI using the fast SFT method.

The Result: The AI learned the puzzle almost as fast as standard SFT, but it didn't forget its old knowledge!

The Takeaway for Everyone

The paper teaches us a valuable lesson about learning, not just for computers, but for us too:

  • Don't just memorize the answer. If you force yourself to memorize a solution without understanding the "why" (the reasoning), you might forget what you already knew.
  • Learn from your own mistakes. When you try to solve a problem yourself, even if you fail, the process of thinking through it keeps your brain connected to what you already know.
  • Data matters more than the method. It doesn't matter if you use a "smart" algorithm or a "dumb" one; if the material you are studying feels natural and aligned with what you already know, you will learn better without losing your past.

In short: Reinforcement Fine-Tuning (RFT) is better at preserving knowledge because it forces the AI to learn using its own "voice" and logic, rather than forcing it to memorize alien answers that clash with its existing brain.