Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

🩺 The Big Picture: Teaching a Robot Doctor to "Think"

Imagine you are training a very smart robot to be a doctor. You want it to look at an X-ray or a microscope slide and explain what it sees and why, just like a human specialist would.

The problem is that most AI models today are like parrots. They are great at memorizing answers to multiple-choice questions (e.g., "Is this a tumor? A, B, or C?"). But when you ask them an open-ended question (e.g., "Describe the abnormalities in this lung and explain the likely cause"), they often just guess or repeat patterns they've seen before, without truly understanding the medical logic.

This paper introduces a new training method called ARMed to fix this. It teaches the robot to think deeply and reason through complex medical problems, not just memorize answers.

🚧 The Problem: The "Flatline" Reward (Reward Collapse)

To train an AI, we use a system called Reinforcement Learning. Think of this like training a dog:

If the dog sits, you give it a treat (a reward).
If it jumps, you give it a "no" (a low reward).

In medical AI, the "treat" is a score given by a computer program that checks if the AI's answer is correct.

The Trap:
In the past, researchers used simple ways to score answers.

The "Word Count" Trap: If the AI says "There is bleeding" and the correct answer is "There is bleeding in the liver," a simple checker might give them the same score because they share the words "bleeding." But in medicine, where the bleeding is matters!
The "Semantic Flatline": Newer checkers try to understand meaning. But they often fall into a trap called Reward Collapse. Imagine a teacher grading 100 essays. If the grading rubric is too vague, the teacher might give everyone a score of 95/100, even if one essay is brilliant and another is nonsense.
- Result: The AI gets confused. "Wait, my bad answer got the same score as my good answer? I'll just keep doing what I was doing." The learning stops because the "treat" isn't special enough to motivate change.

💡 The Solution: ARMed (The "Smart Coach")

The authors created ARMed (Adaptive Reinforcement for Medical Reasoning) to act like a strict but fair coach who knows exactly how to grade a medical student.

Here is how it works in three simple steps:

1. The "Study Hall" Phase (Supervised Fine-Tuning)

Before the robot starts playing the game, it sits in a study hall. It reads thousands of examples where human doctors wrote out their step-by-step reasoning (Chain-of-Thought).

Analogy: Instead of just memorizing the answer key, the robot learns how a doctor thinks: "I see a shadow here, which usually means fluid, but the shape suggests a tumor, so I need to check the edges..."

2. The "Adaptive Grading" Phase (The Secret Sauce)

This is the most important part. When the robot tries to answer a question, ARMed doesn't just give a static score. It uses a dynamic, adaptive reward system.

The Problem it Solves: If the robot gives a slightly wrong answer, a normal system might say "90/100." If it gives a great answer, it might say "92/100." The difference is too small to matter.
The ARMed Fix: ARMed looks at all the answers the robot generated in that moment. It asks: "Which one is truly better?"
- If the robot gives a vague answer, ARMed says: "That's a 40/100. You missed the point!"
- If the robot gives a precise, medically accurate answer, ARMed says: "That's a 98/100! You nailed the nuance!"
- The Metaphor: Imagine a music teacher. A bad teacher says, "Good job" to everyone. A smart teacher (ARMed) listens closely. If you play a note slightly flat, they say, "That was off." If you play it perfectly, they say, "That was beautiful!" This clear distinction tells the student exactly what to fix.

3. The "Knowledge Injection" Phase

Sometimes, the robot gets too confident in its own bad habits (like guessing the same answer for every lung question). ARMed has a special trick: it forces the robot to look at a "library" of diverse medical cases.

Analogy: If the robot keeps saying "It's pneumonia" for every cough, the coach pulls out a book of rare diseases and says, "Look, sometimes it's this rare thing. You need to consider all possibilities." This stops the robot from being lazy.

🏆 The Results: Why It Matters

The researchers tested ARMed on six different medical exams (like the real thing for doctors).

Before ARMed: The AI was like a student who memorized the test answers but failed when the questions were phrased differently.
With ARMed: The AI became like a residency-trained doctor. It could explain why it made a diagnosis, handle tricky questions, and give answers that were not just "correct" but clinically safe and logical.

🌟 The Takeaway

This paper solves a major headache in AI: How do you teach a machine to understand the subtle, life-or-death differences in medicine?

By creating a "Smart Coach" (ARMed) that gives clear, distinct, and fair feedback (fixing the "Reward Collapse"), the AI learns to reason deeply. It moves from being a "parrot" that repeats words to a "thinker" that understands the story behind the medical image.

In short: ARMed teaches AI to stop guessing and start thinking like a doctor, ensuring that when it speaks, it speaks with the precision required to save lives.

1. Problem Statement

The paper addresses the limitations of current Vision-Language Models (VLMs) in open-ended medical Visual Question Answering (VQA). While Reinforcement Learning (RL) has shown promise in enhancing reasoning for general VLMs, its application in medical imaging is hindered by two main issues:

Closed-Ended Bias: Existing medical RL efforts primarily focus on multiple-choice (closed-ended) tasks, which do not reflect the complex, explanatory nature of real-world clinical diagnostic workflows.
Reward Collapse in Semantic Rewards: When applying RL to open-ended tasks, researchers often use semantic similarity metrics (e.g., BERTScore, Cosine Similarity) as rewards. However, these static metrics suffer from reward collapse: they assign nearly identical high scores to semantically distinct responses (e.g., a correct diagnosis vs. a dangerous hallucination that uses similar medical terminology). This results in flattened reward distributions, weak gradients, and ineffective policy optimization.

2. Methodology: The ARMed Framework

The authors propose ARMed (Adaptive Reinforcement for Medical Reasoning), a novel framework designed to mitigate reward collapse and improve open-ended medical reasoning. The framework consists of three core components:

A. Three-Stage Training Pipeline

ARMed employs a progressive training strategy to build robust reasoning capabilities:

Reward-Driven Pretraining (ARMed-I): The base model is trained using a designed reward function for open-ended QA to establish a foundational reasoning model.
Knowledge-Enhanced Fine-Tuning (ARMed-A): The model generates explicit Chain-of-Thought (CoT) reasoning traces for high-frequency medical questions. These traces are used to create a knowledge-augmented dataset for Supervised Fine-Tuning (SFT), injecting domain-specific diagnostic knowledge.
Reward-Based Refinement (ARMed-R): The knowledge-injected model undergoes further optimization using the adaptive reinforcement learning framework to refine factual accuracy and reasoning consistency.

B. Adaptive Semantic Reward Mechanism

The core innovation is a dynamic reward function that combines three components:

Textual Correctness ( $R_c$ ): Uses BLEU-1 and ROUGE-1 to ensure basic lexical overlap with ground truth, providing dense feedback during early training.
Adaptive Semantic Alignment ( $R_{as}$ ): Instead of using static BERTScore or Cosine Similarity, ARMed applies a dynamic calibration process:
- Historical Buffering: Maintains a buffer of past reward scores.
- Threshold Adaptation: Dynamically adjusts a threshold based on the percentile of historical rewards.
- Non-linear Mapping: Applies an asymmetric S-shaped transformation (tanh) to rewards relative to the threshold. This amplifies the distinction between high-quality and low-quality responses, effectively increasing the variance of the reward signal.
Format Reward ( $R_f$ ): Ensures the output adheres to a structured format (e.g., <thought>...</thought> and <answer>...</answer> tags).

The total reward is a weighted sum: $R_{total} = \gamma_1 R_c + \gamma_2 R_{as} + \gamma_3 R_f$ .

C. Group Relative Policy Optimization (GRPO)

ARMed utilizes GRPO, a critic-free RL algorithm. It samples a group of $G$ responses for a single query, calculates the advantage based on the group's relative performance, and updates the policy. The adaptive reward mechanism specifically targets the "Reward Collapse" issue in GRPO by ensuring the semantic component contributes significantly to the advantage calculation, rather than being drowned out by low-variance noise.

D. Medical Thinking Knowledge Injection

To prevent the model from overfitting to high-reward but incorrect patterns (a common RL bias), the authors introduce a Knowledge Injection mechanism. They cluster high-frequency questions and select representative samples to build a diverse, compact knowledge base. This ensures the model learns robust, clinically grounded reasoning patterns rather than superficial correlations.

3. Key Contributions

Identification of Reward Collapse: The paper formally identifies and analyzes the "reward collapse" phenomenon in static semantic reward schemes, where distinct medical responses receive indistinguishable scores, hindering learning.
Adaptive Reward Design: Proposes a novel Adaptive Semantic Reward that dynamically scales reward intensity based on inter-sample variance. This restores gradient informativeness and prevents the collapse of semantic signals.
Comprehensive Framework (ARMed): Introduces a three-stage pipeline integrating CoT supervision, knowledge augmentation, and adaptive RL, specifically tailored for open-ended medical VQA.
State-of-the-Art Performance: Demonstrates that ARMed significantly outperforms both general-purpose and medical-specific baselines on multiple benchmarks, achieving superior accuracy and generalization with fewer parameters than competing large models.

4. Experimental Results

The authors evaluated ARMed on six challenging medical VQA benchmarks (PathVQA, SLAKE, VQA-RAD, VQA-Med, PMC-VQA, MedXpertQA), covering both in-domain and out-of-domain scenarios.

Performance Gains: ARMed achieved State-of-the-Art (SOTA) results.
- On in-domain tests, it improved average scores by 20.67% over the best baseline (InternVL3-2B).
- On out-of-domain tests, it showed a 3.19% gain over the strongest baseline.
- Notably, ARMed (based on a 3B parameter model) outperformed larger models (e.g., InternVL3-14B, HuatuoGPT-Vision-7B) that have more than double the parameters, highlighting its efficiency.
Ablation Studies: Experiments confirmed that each component contributes uniquely:
- Adding textual supervision boosted performance significantly.
- Adaptive Semantic Rewards were crucial for distinguishing between semantically similar but clinically different answers, directly mitigating reward collapse (demonstrated by increased reward variance and Normalized Contribution Index).
Qualitative Analysis: Case studies showed ARMed produces more clinically plausible reasoning traces and avoids the "hallucination" or "joke-like" responses often seen in models trained with static rewards.

5. Significance and Impact

Clinical Relevance: By shifting focus from closed-ended multiple-choice to open-ended reasoning, ARMed better mimics real-world clinical diagnostic workflows, where doctors must provide flexible, context-aware, and explanatory diagnoses.
Robustness: The mitigation of reward collapse ensures that RL signals are reliable, reducing the risk of models learning to "game" the reward function with superficially similar but medically incorrect answers.
Scalability: The framework demonstrates that high-performance medical reasoning can be achieved with smaller, more efficient models through better reward design and knowledge injection, rather than simply scaling up model size.
Future Directions: The work lays the groundwork for developing clinically reliable multimodal reasoning systems and suggests the need for more semantically grounded evaluation metrics beyond standard NLP metrics like BLEU/ROUGE.

In summary, ARMed represents a significant advancement in medical AI by solving the critical bottleneck of reward signal degradation in open-ended RL, enabling vision-language models to reason with the depth and reliability required for clinical applications.