Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective

Here is an explanation of the paper "Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective" using simple language and creative analogies.

The Big Picture: Learning a New Trick Without Forgetting the Old Ones

Imagine you have a brilliant, well-read librarian (the AI model) who knows everything about history, science, and art. You want to teach this librarian a brand-new skill: solving a jigsaw puzzle.

The paper asks a critical question: How do we teach this new skill without making the librarian forget everything they already know?

The researchers compared two ways of teaching:

SFT (Supervised Fine-Tuning): The "Teacher-Student" method. You give the librarian the puzzle pieces and the exact answer key. "Put piece 1 here, piece 2 there."
RFT (Reinforcement Fine-Tuning): The "Trial-and-Error" method. You let the librarian try to solve the puzzle, and you only give them a "Good job!" (reward) if they get it right, or a "Try again" if they fail.

The Shocking Discovery

The researchers found a sharp trade-off:

SFT is fast but dangerous. The librarian learns the puzzle quickly (in just a few hours), but immediately after, they start forgetting how to read history books or identify famous paintings. It's like they are so focused on the new puzzle that their brain "wipes" the old library shelves.
RFT is slow but safe. The librarian takes a long time to figure out the puzzle (weeks of practice), but once they learn it, they still remember everything about history and art. Their old knowledge remains intact.

Why does this happen? The paper argues it's not because RFT is a "smarter" algorithm. It's because of what data the librarian is studying.

The "Data Perspective": The Library Analogy

To understand why RFT is safer, the researchers looked at the "data" (the study materials) used in both methods.

1. The "Foreign Language" Problem (SFT)

When using SFT, humans often write out the answers for the AI.

The Analogy: Imagine the librarian is trying to learn a new language. If you force them to memorize a dictionary written in a completely foreign, complex dialect they've never heard before, their brain gets confused. To make room for this new, confusing information, their brain accidentally deletes the old, familiar stories they know.
In the paper: The "Non-Reasoning" data (just the answer) is like that foreign dialect. It creates a huge "shock" to the system, causing Catastrophic Forgetting.

2. The "Native Speaker" Advantage (RFT)

When using RFT, the AI generates its own attempts (rollouts) and learns from its own mistakes.

The Analogy: Instead of forcing a foreign dialect on the librarian, you let them practice speaking in their own natural voice. They might stumble at first, but the words they use are already part of their vocabulary. Because the new puzzle pieces are being assembled using words and logic the librarian already understands, their brain doesn't feel the need to delete old memories to make space.
In the paper: The data generated by RFT stays in a "low perplexity" zone. This means the AI is learning from examples that feel natural and familiar to its existing knowledge, rather than jarring, alien examples.

The "Jigsaw Puzzle" Test

Why use jigsaw puzzles?

Imagine asking a super-smart person who has read every book in the world to solve a 3x3 jigsaw puzzle. They would likely fail because no one has ever taught them how to do that specific puzzle before.
The researchers used this as a "clean slate" test. Since the AI has zero prior knowledge of jigsaw puzzles, any learning is brand new. This allowed them to see clearly: Did learning this new thing make the AI forget its old knowledge?

The "Magic" Solution: Copying the Winner

The most exciting part of the paper is a "hack" they discovered.

They realized that the RFT method works so well because it generates high-quality, natural-looking practice data. So, they tried this:

Let the AI practice with RFT for a little while (just enough to get some good examples).
Take those examples (the AI's own "thinking process" and correct answers).
Use those examples to train the AI using the fast SFT method.

The Result: The AI learned the puzzle almost as fast as standard SFT, but it didn't forget its old knowledge!

The Takeaway for Everyone

The paper teaches us a valuable lesson about learning, not just for computers, but for us too:

Don't just memorize the answer. If you force yourself to memorize a solution without understanding the "why" (the reasoning), you might forget what you already knew.
Learn from your own mistakes. When you try to solve a problem yourself, even if you fail, the process of thinking through it keeps your brain connected to what you already know.
Data matters more than the method. It doesn't matter if you use a "smart" algorithm or a "dumb" one; if the material you are studying feels natural and aligned with what you already know, you will learn better without losing your past.

In short: Reinforcement Fine-Tuning (RFT) is better at preserving knowledge because it forces the AI to learn using its own "voice" and logic, rather than forcing it to memorize alien answers that clash with its existing brain.

Here is a detailed technical summary of the paper "Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective", published at ICLR 2026.

1. Problem Statement

The paper addresses the critical issue of Catastrophic Forgetting (CF) in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) during post-training. While Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are effective for adapting models to new downstream tasks, they often degrade the model's performance on previously learned knowledge.

The Gap: Existing literature focuses on performance gains but lacks a deep understanding of why RFT (e.g., GRPO) appears to preserve prior knowledge better than SFT.
The Hypothesis: The authors propose that the difference is not primarily due to the optimization algorithms themselves, but rather the distribution of the training data (specifically the presence of reasoning trajectories and their alignment with the base model's probability landscape).

2. Methodology

A. Novel Task: Jigsaw Puzzles

To rigorously test knowledge retention without contamination from pretraining data, the authors introduce Jigsaw Puzzles as a novel task.

Setup: Images from COCO are split into a $3 \times 3$ grid, shuffled, and the model must output the correct permutation of indices.
Rationale: State-of-the-art models (including GPT-4o and Qwen2.5-VL-72B) achieve near-zero accuracy on this task in a zero-shot setting, confirming it is a genuinely novel task absent from pretraining corpora.

B. Experimental Design

The authors conducted systematic experiments on the Qwen2.5-VL series (3B and 7B parameters) comparing four post-training strategies:

RFT (GRPO): Reinforcement Fine-Tuning using Group Relative Policy Optimization with rule-based rewards (hit rate, accuracy, format).
SFT-Non-Rea: Standard SFT on ground-truth answers without reasoning.
SFT-Rea-4o-Rollout: SFT on reasoning trajectories generated by GPT-4o.
SFT-Rea-GRPO-Rollout: SFT on reasoning trajectories generated by the RFT-trained model itself (self-generated rollouts).

C. Theoretical Framework: Learning Dynamics

To explain the observed phenomena, the authors apply Learning Dynamics theory (Ren & Sutherland, 2024). They analyze how a training example $x_u$ influences the likelihood of a prior knowledge example $x_v$ .

Key Metric: The Empirical Neural Tangent Kernel (eNTK), denoted as $K_t(x_v, x_u)$ .
Interference Measure: They define a Lower Bound of Kernel (LBK) to estimate the magnitude of interference. A larger LBK implies stronger interference and higher risk of forgetting.
Symmetry Property: They leverage the symmetry of learning dynamics ( $\Delta \log \pi(x_v)|_{x_u} \approx \Delta \log \pi(x_u)|_{x_v}$ ) to argue that if a model assigns low perplexity to new data $x_u$ based on prior knowledge $x_v$ , training on $x_u$ will minimally disrupt $x_v$ .

3. Key Contributions

Empirical Discovery of the Trade-off:
- SFT learns new tasks rapidly (hundreds of steps) but causes severe catastrophic forgetting, especially on tasks with similar output formats (e.g., Grounding/RefCOCO).
- RFT learns more slowly (tens of thousands of steps) but preserves prior knowledge significantly better.
Data-Centric Explanation (The "Why"):
- The authors demonstrate that data construction is the primary driver of forgetting, not the algorithm.
- When SFT is performed on RFT-generated rollouts (SFT-Rea-GRPO-Rollout), it achieves performance comparable to RFT on the new task while retaining prior knowledge much better than standard SFT.
- Reasoning Trajectories: The inclusion of explicit reasoning (Chain-of-Thought) reduces interference compared to direct answer-only supervision.
Theoretical Analysis via Learning Dynamics:
- Magnitude of Interference: SFT on non-reasoning data exhibits a much larger eNTK norm (LBK) with prior knowledge, indicating high interference. Reasoning data shows lower LBK.
- Direction of Interference (Perplexity):
  - GPT-4o Rollouts: Often fall in high-perplexity regions of the base model, meaning they are "alien" to the model's existing distribution, causing significant disruption.
  - RFT Rollouts: Naturally explore and reinforce low-perplexity regions that the base model already partially understands. Because the base model assigns moderate probability to these regions, reinforcing them causes less disruption to existing knowledge (due to the symmetry property).
Generalization:
- The findings were validated on Math Reasoning (Qwen2.5-Instruct) and Scientific QA tasks, showing consistent trends: SFT on self-generated RFT rollouts offers the best trade-off between new task acquisition and old task retention.

4. Key Results

Performance on Novel Tasks:
- RFT achieved ~66-75% accuracy on Jigsaw puzzles.
- SFT on RFT-rollouts achieved similar accuracy (~70-81%) with far fewer training steps than RFT.
Forgetting Metrics:
- SFT-Non-Rea: Caused massive drops in Grounding (e.g., RefCOCOval dropped from 88.8 to 6.1 on 3B model).
- SFT-Rea-4o-Rollout: Reduced forgetting compared to Non-Rea but still suffered significant degradation.
- SFT-Rea-GRPO-Rollout: Maintained prior knowledge scores close to the base model (e.g., RefCOCOval dropped only to 84.6) while mastering the new task.
Perplexity Analysis:
- Training on RFT-rollouts kept the perplexity of prior knowledge stable or decreasing.
- Training on GPT-4o rollouts caused the perplexity of prior knowledge to increase steadily, correlating with forgetting.

5. Significance and Implications

Paradigm Shift: The paper challenges the view that RFT is inherently superior due to its optimization mechanics. Instead, it argues that RFT's success in preventing forgetting is a byproduct of its data generation process, which naturally samples from the model's own "comfort zone" (low-perplexity regions) and generates reasoning paths.
Practical Strategy: It proposes a hybrid workflow: Use a short phase of RFT to generate high-quality, self-consistent reasoning data, and then use this data for SFT. This approach combines the rapid learning speed of SFT with the stability of RFT.
Future Directions: The findings suggest that future post-training efforts should focus heavily on data selection and construction (ensuring alignment with the base model's distribution and including reasoning) rather than solely on algorithmic tweaks.

In conclusion, the paper provides a principled, data-centric explanation for catastrophic forgetting, demonstrating that RFT preserves knowledge because it discovers and reinforces linguistic regions that are already compatible with the model's prior distribution, whereas standard SFT often forces the model into high-perplexity, disruptive regions.