When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

This paper presents a controlled study demonstrating that reinforcement learning primarily sharpens output distributions and improves sampling efficiency in medical Vision-Language Models only after supervised fine-tuning has established non-trivial reasoning support, leading to a boundary-aware training recipe that achieves strong performance across medical benchmarks.

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Question: Is Reinforcement Learning (RL) Magic?

Imagine you are training a brilliant medical student (an AI) to diagnose diseases from X-rays and lab reports. You have three tools to teach them:

  1. Vision: Showing them thousands of pictures so they learn what a broken bone or a tumor looks like.
  2. Supervised Fine-Tuning (SFT): Giving them a textbook of "Question and Answer" pairs so they learn the right way to talk to doctors.
  3. Reinforcement Learning (RL): A "game" where the student gets a gold star for a correct answer and a red X for a wrong one, encouraging them to figure out the best way to think.

The big question this paper asks is: Does RL actually teach the student new medical knowledge, or does it just help them pick the right answer faster when they already know it?

The authors found that RL is not a magic wand that creates new knowledge. Instead, it's more like a polishing tool.


The Three Key Discoveries (The Story of the Student)

1. The Vision Check: "Can they actually see?"

First, the researchers checked if the AI could actually see the medical images.

  • The Analogy: Imagine giving the student a pair of glasses. If the glasses are blurry, no amount of studying will help them read the fine print.
  • The Finding: The base AI (the student before special training) already has decent "glasses." It can see the images well enough. However, RL does not improve the glasses. If the AI can't see the disease in the picture, RL won't fix that. Only better vision training (SFT) can fix blurry vision.

2. The "Pass@K" Secret: "Do they know the answer but say the wrong one?"

This is the most important part of the paper. The researchers looked at two ways to measure success:

  • Accuracy@1 (Greedy): The student gives their first guess.

  • Pass@K (Latent Support): The student is allowed to think of 8 different answers and pick the best one.

  • The Analogy: Imagine a student taking a multiple-choice test.

    • Scenario A: The student doesn't know the answer. Even if they try 10 times, they get it wrong. (Low "Pass@K").
    • Scenario B: The student knows the answer deep down, but their first guess is usually a silly mistake because they are nervous or rambling. However, if you ask them to try 10 times, they eventually get it right. (High "Pass@K", Low "Accuracy@1").
  • The Finding: Medical SFT (the textbook study) helps the student learn the material (increasing both their first guess and their 10th guess).

    • RL's Role: RL is like a coach who helps the student calm down and focus. It doesn't teach them new facts. Instead, it helps them stop making silly first guesses and pick the right answer they already knew was there. It "sharpens" their focus.

3. When Does RL Actually Help?

The researchers discovered a specific rule for when RL works:

  • If the student knows the answer (High Pass@K): RL is amazing. It helps them stop hesitating and pick the right answer immediately.
  • If the student is clueless (Low Pass@K): RL is useless. You can't coach someone to pick the right answer if they don't know the material. In fact, forcing RL on a confused student might make them worse because they start guessing randomly to please the coach.

The "MedBridgeRL" Recipe: A Step-by-Step Guide

Based on these findings, the authors propose a new way to train medical AIs, which they call a "Boundary-Aware Recipe." Think of it like building a house:

  1. Step 1: Check the Foundation (Diagnose Support).
    Before you try to polish the house, check if the foundation is solid. Ask the AI: "If I let you try 10 times, can you get the right answer?"

    • If the answer is NO: The foundation is weak. Do not use RL yet. You need more textbooks (SFT) to teach the basics.
  2. Step 2: Bridge the Gap (SFT).
    If the AI is struggling, give it more medical data and supervised training. This is "bridging" the gap. This expands the AI's knowledge so it can find the right answer if it tries hard enough.

  3. Step 3: Sharpen the Edge (RL).
    Once the AI knows the material (High Pass@K), now you use RL. This is the "polishing" phase. RL helps the AI stop rambling and give the correct answer on the very first try.

The Result: MedBridgeRL

The team tested this recipe. They took a strong medical AI (OctoMed), gave it a little bit of extra "bridging" training, and then applied RL.

  • The Outcome: Their new model became the top performer across six different medical benchmarks. It didn't just get lucky; it learned to be reliable and accurate because they followed the right order: Learn first, then polish.

Summary in One Sentence

RL doesn't teach medical AIs new things; it just helps them stop making mistakes on things they already know, but only if you teach them the basics first.