When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

The Big Question: Is Reinforcement Learning (RL) Magic?

Imagine you are training a brilliant medical student (an AI) to diagnose diseases from X-rays and lab reports. You have three tools to teach them:

Vision: Showing them thousands of pictures so they learn what a broken bone or a tumor looks like.
Supervised Fine-Tuning (SFT): Giving them a textbook of "Question and Answer" pairs so they learn the right way to talk to doctors.
Reinforcement Learning (RL): A "game" where the student gets a gold star for a correct answer and a red X for a wrong one, encouraging them to figure out the best way to think.

The big question this paper asks is: Does RL actually teach the student new medical knowledge, or does it just help them pick the right answer faster when they already know it?

The authors found that RL is not a magic wand that creates new knowledge. Instead, it's more like a polishing tool.

The Three Key Discoveries (The Story of the Student)

1. The Vision Check: "Can they actually see?"

First, the researchers checked if the AI could actually see the medical images.

The Analogy: Imagine giving the student a pair of glasses. If the glasses are blurry, no amount of studying will help them read the fine print.
The Finding: The base AI (the student before special training) already has decent "glasses." It can see the images well enough. However, RL does not improve the glasses. If the AI can't see the disease in the picture, RL won't fix that. Only better vision training (SFT) can fix blurry vision.

2. The "Pass@K" Secret: "Do they know the answer but say the wrong one?"

This is the most important part of the paper. The researchers looked at two ways to measure success:

Accuracy@1 (Greedy): The student gives their first guess.
Pass@K (Latent Support): The student is allowed to think of 8 different answers and pick the best one.
The Analogy: Imagine a student taking a multiple-choice test.
- Scenario A: The student doesn't know the answer. Even if they try 10 times, they get it wrong. (Low "Pass@K").
- Scenario B: The student knows the answer deep down, but their first guess is usually a silly mistake because they are nervous or rambling. However, if you ask them to try 10 times, they eventually get it right. (High "Pass@K", Low "Accuracy@1").
The Finding: Medical SFT (the textbook study) helps the student learn the material (increasing both their first guess and their 10th guess).
- RL's Role: RL is like a coach who helps the student calm down and focus. It doesn't teach them new facts. Instead, it helps them stop making silly first guesses and pick the right answer they already knew was there. It "sharpens" their focus.

3. When Does RL Actually Help?

The researchers discovered a specific rule for when RL works:

If the student knows the answer (High Pass@K): RL is amazing. It helps them stop hesitating and pick the right answer immediately.
If the student is clueless (Low Pass@K): RL is useless. You can't coach someone to pick the right answer if they don't know the material. In fact, forcing RL on a confused student might make them worse because they start guessing randomly to please the coach.

The "MedBridgeRL" Recipe: A Step-by-Step Guide

Based on these findings, the authors propose a new way to train medical AIs, which they call a "Boundary-Aware Recipe." Think of it like building a house:

Step 1: Check the Foundation (Diagnose Support).
Before you try to polish the house, check if the foundation is solid. Ask the AI: "If I let you try 10 times, can you get the right answer?"
- If the answer is NO: The foundation is weak. Do not use RL yet. You need more textbooks (SFT) to teach the basics.
Step 2: Bridge the Gap (SFT).
If the AI is struggling, give it more medical data and supervised training. This is "bridging" the gap. This expands the AI's knowledge so it can find the right answer if it tries hard enough.
Step 3: Sharpen the Edge (RL).
Once the AI knows the material (High Pass@K), now you use RL. This is the "polishing" phase. RL helps the AI stop rambling and give the correct answer on the very first try.

The Result: MedBridgeRL

The team tested this recipe. They took a strong medical AI (OctoMed), gave it a little bit of extra "bridging" training, and then applied RL.

The Outcome: Their new model became the top performer across six different medical benchmarks. It didn't just get lucky; it learned to be reliable and accurate because they followed the right order: Learn first, then polish.

Summary in One Sentence

RL doesn't teach medical AIs new things; it just helps them stop making mistakes on things they already know, but only if you teach them the basics first.

1. Problem Statement

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly used to post-train Medical Vision-Language Models (VLMs) to improve reasoning and reliability. However, it remains unclear whether RL genuinely enhances visual reasoning capabilities or merely sharpens the output distribution of models that have already been conditioned by Supervised Fine-Tuning (SFT).

Current pipelines often apply SFT followed by RL without understanding:

How much improvement stems from visual perception versus language alignment.
Whether RL expands the model's "support" (the set of correct answers it can generate) or simply improves the sampling efficiency of existing correct answers.
Under what specific conditions (e.g., modality shifts, data scarcity) RL is cost-effective in medical settings.

2. Methodology

The authors propose a controlled study using MedMNIST-v2 as a multi-modality testbed to disentangle the effects of Vision, SFT, and RL.

Experimental Setup

Base Models:
- MBase: Qwen2.5-VL-7B-Instruct (General pre-trained).
- MSFT: OctoMed (Qwen2.5-VL fine-tuned on medical data).
- MRL: QoQ-Med (A representative RL-post-trained medical VLM).
Evaluation Axes:
1. Visual Perception: Linear probing on frozen vision encoders (ViT) to measure if visual features are separable for medical tasks.
2. Reasoning Capacity (Support Boundary): Comparing Accuracy@1 (greedy decoding) vs. Pass@K (probability that at least one of $K$ samples is correct). This distinguishes between latent capability (Pass@K) and sampling efficiency (Acc@1).
3. RL Impact: Training targeted RL variants (using a consistency-aware GRPO pipeline) on specific tasks (OrganA, Path, OCT) initialized from either MBase or MSFT.
4. Transfer Evaluation: Testing performance on in-domain tasks, within-modality shifts (e.g., OrganA $\to$ OrganC), and cross-modality shifts (e.g., OCT $\to$ Path).

The Proposed "Boundary-Aware Recipe"

Based on empirical findings, the authors propose a staged training strategy:

Diagnose: Measure Pass@K (Support, $S_K$ ) and Acc@1 (Default Behavior, $A$ ) on a small validation set.
Bridge (if $S_K < \tau$ ): If support is weak, prioritize SFT with task-proximal data to expand the model's coverage and raise Pass@K.
Sharpen (if $S_K \ge \tau$ ): Once sufficient support exists, apply RL to improve sampling efficiency (increase Acc@1) without collapsing the support.

3. Key Contributions

Disentanglement of Effects: The study isolates the roles of vision, SFT, and RL, demonstrating that RL primarily acts as a distribution sharpening mechanism rather than a creator of new visual reasoning capabilities.
Support Boundary Characterization: The authors define the "support boundary" using Pass@K vs. Acc@1, revealing that medical VLMs often possess latent correct answers (high Pass@K) that greedy decoding fails to retrieve (low Acc@1).
Conditional RL Efficacy: They establish that RL is most effective only when the model already has non-trivial support. Applying RL to models with weak support (e.g., unbridged base models on new modalities) can degrade performance.
Practical Recipe & Validation: They propose and validate a "Bridge-then-Sharpen" recipe, achieving state-of-the-art results on multiple medical VQA benchmarks by post-training an OctoMed-initialized model on a balanced subset of PMC-VQA.

4. Key Results & Findings

Finding 1: Visual Perception Limits

Linear Probing: The base model (MBase) already possesses reasonably separable visual features for many MedMNIST tasks.
Role of SFT: Medical SFT significantly improves these visual representations, especially on weaker datasets.
Role of RL: RL does not consistently improve frozen ViT probe accuracy. Its benefits are primarily on the language/alignment side, not the visual feature extraction side.

Finding 2: The Support Gap

Latent Capability: Across tasks, Pass@K is often significantly higher than Acc@1, indicating that correct answers exist in the model's distribution but are missed by greedy decoding.
SFT Impact: SFT raises both Acc@1 and Pass@K, effectively expanding the model's "support" (coverage of correct answers).
RL Impact: RL-post-trained baselines (like QoQ-Med) often fail to improve Acc@1 on MedMNIST and can even reduce Pass@K. This suggests RL reshapes the distribution to sample existing answers more efficiently but does not expand the underlying support. In some cases, it narrows the boundary.

Finding 3: When RL Helps

In-Domain/Small Shifts: RL acts as a "sharpening" tool. If the model has high Pass@K (non-trivial support), RL improves Acc@1 by placing more probability mass on the correct answers.
Large/Cross-Modality Shifts: When support is weak (e.g., cross-modality transfer), RL provides limited accuracy gains and can cause Pass@K to drop, especially if initialized from a base model without prior SFT bridging.

Validation on Medical Benchmarks

The authors instantiated their recipe by applying RL to OctoMed-7B on a balanced subset of PMC-VQA.
Result: The resulting model (Ours) achieved the strongest average performance (64.91) across six medical VQA benchmarks (PMC, MMMU, MedX-M, PathVQA, SLAKE, VQA-Rad), outperforming other Qwen-based medical VLMs and specialized RL models like MedVLThinker and Med-R1.

5. Significance

Redefining RL in Medicine: The paper challenges the assumption that RL is a universal fix for medical reasoning. It argues that RL is a refinement stage, not a foundation-building stage.
Cost-Efficiency: By identifying the "support boundary," practitioners can avoid wasting resources on RL when SFT (bridging) is the necessary step to establish competence.
Reliability: The proposed "Bridge-then-Sharpen" approach ensures that models are robust across modalities before attempting to optimize sampling efficiency, leading to more reliable clinical decision support systems.
Generalizability: The findings suggest that the dynamics of RL in medical VLMs mirror those in general LLMs (reshaping distributions rather than creating new reasoning), but with stricter constraints due to the high stakes and modality-specific nature of medical data.