MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Imagine you are trying to teach a brilliant but inexperienced medical student how to diagnose patients. You have two main ways to teach them:

The "Cramming" Method (SFT): You give the student a stack of answer keys written by top doctors. You say, "Memorize these steps and answers exactly." The student memorizes the text perfectly but might get confused if the question changes slightly or if they have to look at an X-ray while reading the text.
The "Trial and Error" Method (RLVR): You give the student a pile of medical cases and say, "Try to solve these. If you get the final diagnosis right, you get a gold star. If you get it wrong, you get a red X. You don't need to memorize the steps; just figure out how to get that gold star."

This paper, MedVLThinker, is about discovering that the second method (Trial and Error) is actually much better for teaching AI doctors how to "think" before they answer.

Here is a breakdown of their journey using simple analogies:

1. The Problem: The "Black Box" Doctors

In the world of medical AI, there are powerful models that can look at X-rays and read reports. However, most of them are like black boxes: they give an answer, but we don't know how they got there. Some researchers tried to make them "think" (like a human doctor reasoning through a case), but they kept their recipes secret or used very small, specific datasets. It was like trying to learn to cook by watching a chef who never lets you into the kitchen.

2. The Solution: A Public "Cookbook"

The authors created MedVLThinker, which is essentially a fully open cookbook for building AI doctors that can reason. They didn't just release the final dish (the AI model); they released the ingredients (data), the tools (code), and the step-by-step instructions.

3. The Secret Sauce: Filtering the "Goldilocks" Questions

Imagine you are training a dog. You wouldn't ask it to fetch a ball that is too heavy (impossible) or a ball that is right in its mouth (too easy). You want the "just right" ball.

The researchers did this with medical questions:

They took thousands of questions.
They asked a smart AI to answer them 16 times.
If the AI got it right 16/16 times, the question was too easy (boring).
If the AI got it wrong 16/16 times, the question was too hard (frustrating).
They kept only the questions where the AI got it right sometimes but not always. These are the "Goldilocks" questions that force the AI to actually think to improve.

4. The Big Surprise: Text is Better Than Pictures?

This is the most counter-intuitive part of the paper.

The Expectation: Since doctors look at X-rays and MRIs, you'd think training an AI on images + text would be the best way to teach it medical reasoning.
The Reality: The researchers found that training the AI on text-only reasoning (like reading a medical textbook and solving logic puzzles) actually made it better at looking at images later.

The Analogy: Imagine teaching someone to drive a car.

Image-Text Training: You put them in a real car with a steering wheel and a road, but the instructor keeps shouting confusing instructions. They get overwhelmed.
Text-Only Training: You have them sit in a chair and mentally simulate driving, solving traffic scenarios, and learning the rules of the road in their head.
The Result: When you finally put them in the real car (the image), the person who practiced the mental simulation (text-only) drove better than the one who just stared at the road (image-text). The "mental muscle" for reasoning was stronger.

5. The "Thinker" vs. The "Mimicker"

The paper compared two training styles again:

SFT (The Mimicker): The AI copies the reasoning steps of a super-smart teacher. It learns to sound smart but doesn't necessarily learn how to think.
RLVR (The Thinker): The AI tries to solve the problem on its own. It gets a reward only when the final answer is correct. This forces the AI to develop its own internal logic to get the reward.

The Verdict: The "Thinker" (RLVR) consistently beat the "Mimicker" (SFT). The AI learned to reason, not just to parrot.

6. The Result: A Small AI That Acts Like a Giant

The researchers built a model called MedVLThinker-32B.

It is an open-source model (free for anyone to use).
It is smaller than the most famous commercial AI (GPT-4o), which is like a "closed-source giant."
The Shock: MedVLThinker-32B performed just as well as the expensive, closed GPT-4o on medical tests.

Why This Matters

This paper is a game-changer because it proves you don't need a billion-dollar budget or secret data to build a top-tier medical AI. You just need:

Good data filtering (finding the right difficulty level).
The right training method (letting the AI think and get rewarded for being right, rather than just copying).
Openness (sharing the recipe so everyone can improve).

It's like showing the world that you can build a Ferrari engine in your garage if you have the right blueprint, rather than needing to buy one from a secret factory.

1. Problem Statement

The field of medical AI is increasingly relying on Large Multimodal Models (LMMs) to integrate diverse data sources (clinical notes, radiology images, lab results). While Large Reasoning Models (LRMs) have shown success in text-only domains by using "Chain-of-Thought" (CoT) reasoning to "think before responding," their application to multimodal medical reasoning remains underexplored.

Key challenges identified by the authors include:

Lack of Openness: Existing medical reasoning models are often closed-source, release only weights without data/code, or are confined to narrow domains (e.g., radiology only).
Missing Recipes: There is no standardized, reproducible framework for building reasoning-centric medical LMMs that compares different training paradigms (Supervised Fine-Tuning vs. Reinforcement Learning) and data modalities (text-only vs. image-text).
Data Quality: Current multimodal medical datasets often contain noise or insufficient reasoning depth, making it unclear how to best curate data for reasoning tasks.

2. Methodology

The authors propose MedVLThinker, a fully open-source recipe for training multimodal medical reasoning models based on the Qwen2.5-VL family (3B, 7B, and 32B parameters). The pipeline consists of two main stages:

A. Data Curation and Filtering

The authors curate two datasets and apply a difficulty-based filtering strategy to ensure the training data is neither too easy nor too hard for the base model:

Text-Only Dataset: Derived from m23k (23k multiple-choice questions from MedQA, MedMCQA, HeadQA), augmented with high-quality CoT traces distilled from the DeepSeek-R1 model.
Multimodal Dataset: Derived from PMC-VQA (176k image-text pairs from biomedical literature), augmented with CoT traces from GPT-4o.
Filtering Process: The base model (Qwen2.5-VL) attempts each question 16 times. Questions are filtered out if the model answers correctly in 0 trials (too hard) or $\ge$ 7 trials (too easy). This leaves a "medium-difficulty" subset (16,512 text-only and 115,456 image-text questions).

B. Training Paradigms

The paper compares two primary training strategies:

Supervised Fine-Tuning (SFT): The model is trained to reproduce the high-quality CoT traces generated by teacher models (DeepSeek-R1 for text, GPT-4o for images). This provides dense supervision but relies on the quality of the distilled data.
Reinforcement Learning with Verifiable Rewards (RLVR):
- Algorithm: Uses Group Relative Policy Optimization (GRPO), which eliminates the need for a separate value network (critic).
- Reward Signal: Binary rewards (+1 for correct final answer, -1 for incorrect) based on exact match verification. No explicit CoT supervision is required during RL; the model learns to reason to maximize the probability of a correct final answer.
- Process: The model generates multiple reasoning traces (rollouts) per question; rewards are normalized and used to update the policy.

3. Key Contributions

First Open Recipe: MedVLThinker is the first fully open-source framework providing data, code, and training pipelines for generalizable multimodal medical reasoning.
Counter-Intuitive Finding on Data Modality: The study reveals that training on text-only reasoning data yields superior performance compared to training on multimodal (image-text) data.
- SFT on text-only data actually degraded performance for the 7B model (dropping from 53.5% to 43.8%).
- However, RLVR on text-only data provided the largest performance boost (7B model improved from 53.5% to 54.9%).
- Training on image-text data (PMC-VQA) generally yielded smaller gains or degraded performance, likely due to the lower quality/noise of automatically generated medical image QA pairs.
Superiority of RLVR over SFT: Across all scales and benchmarks, RLVR consistently outperformed SFT. SFT on distilled CoT traces often led to overfitting or mismatched reasoning patterns, whereas RLVR allowed the model to develop its own robust reasoning strategies.
Scaling Laws: Performance scales effectively with model size. The 7B model outperformed the 3B, and the 32B model achieved state-of-the-art results.

4. Experimental Results

The models were evaluated on six multimodal medical QA benchmarks (PMC-VQA, MMMU-Med, MedXpertQA-MM, PathVQA, SLAKE, VQA-RAD).

State-of-the-Art (Open Source): The MedVLThinker-7B (trained with RLVR on text-only data) achieved a new SOTA average accuracy of 54.88%, surpassing previous open models like HuatuoGPT-Vision-7B and LLaVA-Med.
Competing with Proprietary Models: The MedVLThinker-32B (RLVR on text-only) achieved an average accuracy of 63.12%, which is on par with the proprietary GPT-4o (63.74%) and significantly outperforms GPT-4o-mini (58.24%).
Ablation Studies:
- Combining text and image data (e.g., SFT on text + RL on images) did not yield additional gains over text-only RLVR; in some cases, it hurt performance.
- The text-only RLVR approach proved to be the most efficient and effective recipe.

5. Significance and Impact

Paradigm Shift: The paper challenges the assumption that multimodal medical models must be trained on massive multimodal datasets to perform well. It suggests that high-quality text-only reasoning data can effectively bootstrap reasoning capabilities in multimodal models, likely because the reasoning logic is modality-agnostic and the text data is of higher quality than current synthetic image-text datasets.
Reproducibility: By releasing all curated data, models, and code, the authors provide a strong foundation for the community to build upon, addressing the "black box" nature of many current medical AI systems.
Efficiency: The results demonstrate that open-source models, when trained with the correct reasoning paradigm (RLVR) and data strategy, can close the gap with closed-source commercial giants like GPT-4o in specialized medical domains.

In conclusion, MedVLThinker establishes that Reinforcement Learning with Verifiable Rewards (RLVR) applied to high-quality text-only reasoning data is the most effective current strategy for building robust, open-source multimodal medical reasoning models.