CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

🏥 The Big Problem: "The Blind Spot" in Medical AI

Imagine you have a robot doctor. For years, this robot has been great at finding obvious things, like spotting a broken bone in an X-ray or finding a healthy liver. It works like a pattern matcher: "If it looks like a liver, I'll draw a box around it."

But when the robot encounters a complex, messy disease (like a weirdly shaped tumor hidden in noisy, blurry images), it gets confused. It tries to guess based on what it thinks a tumor usually looks like, rather than actually thinking about what it sees. It's like a student who memorized the answers to a math test but fails when the teacher changes the numbers slightly.

Current AI models are either:

Too smart but blind: They can talk a lot about medicine but can't point to the exact spot on the image.
Too good at pointing but dumb: They can draw a box around a spot, but they can't explain why it's a tumor or handle tricky, blurry cases.

💡 The Solution: CORE-Seg (The "Detective" Robot)

The researchers built a new AI called CORE-Seg. Think of it not as a robot that just "sees," but as a medical detective.

Instead of just looking at a picture and guessing, this detective follows a strict three-step process:

Observe: "I see a dark, blurry spot here."
Reason: "In a healthy body, this area should be bright. The fact that it's dark and irregular suggests a tumor."
Act: "Okay, I'm going to draw the outline around this specific spot."

This paper introduces a system that forces the AI to think before it acts, just like a human doctor does.

🛠️ How They Built It: The Three Magic Ingredients

To teach this robot to be a detective, the team did three amazing things:

1. The "Hard Mode" Training Manual (ComLesion-14K)

Imagine you are training a pilot. If you only let them fly in perfect weather on a clear runway, they will crash when it rains.

What they did: The researchers created a massive new dataset called ComLesion-14K. Instead of easy, clear pictures, they gathered 14,000 cases of messy, difficult, and confusing medical images (blurry, noisy, weird shapes).
The Analogy: They didn't just give the AI a textbook; they threw it into a storm simulator. They also added "thought bubbles" (Chain-of-Thought) to every image, showing the AI exactly how a human expert reasoned through the mess.

2. The "Translator" Bridge (Semantic-Guided Prompt Adapter)

The AI has two brains: one that speaks Language (reasoning) and one that sees Images (segmentation). Usually, these two don't talk to each other well.

What they did: They built a special "translator" module. When the Language brain thinks, "This looks like a tumor because it's irregular," the Translator instantly converts that thought into a visual signal for the Image brain.
The Analogy: Imagine a conductor (the Reasoning) and an orchestra (the Segmentation). Before, the conductor just waved a stick, and the orchestra guessed what to play. Now, the conductor has a magic walkie-talkie that tells the orchestra exactly which notes to hit, ensuring they play the right tune together.

3. The "Coach" with a Smart Scorecard (Reinforcement Learning)

You can't just teach a robot once and hope it gets it right. It needs practice and feedback.

What they did: They used a training method called Reinforcement Learning. The AI tries to solve a case, and a "Coach" (a reward system) gives it points.
- The Trick: Usually, if the AI misses the tumor completely, it gets zero points and stops learning. The researchers invented a Smart Scorecard that gives partial credit even if the AI is close but not perfect.
- The Analogy: If you are learning to shoot a basketball, and you miss the hoop but hit the backboard, a normal coach says "0 points, try again." This new Coach says, "Good! You hit the backboard. Next time, aim 2 inches higher." This keeps the AI motivated and learning even when it fails.

🏆 The Results: Why It Matters

When they tested this new "Detective Robot" against the best existing AI models:

It won by a landslide: It was 15% more accurate than the second-best model. In the world of medical AI, that's like going from a C-grade student to an A+ valedictorian.
It rarely gives up: Other models often fail completely (giving a blank answer) when the image is hard. This new model only failed 18% of the time, whereas others failed much more often.
It explains itself: Because it reasons first, it can tell you why it found the tumor, which is crucial for doctors to trust the AI.

🚀 The Bottom Line

This paper is about teaching AI to stop guessing and start thinking.

By creating a "hard mode" training set, building a bridge between language and vision, and using a smart coaching system that rewards progress even in failure, the researchers created CORE-Seg. It's a step toward AI that doesn't just see pixels, but understands the story behind the disease, making it a safer and more reliable partner for doctors.

1. Problem Statement

Medical image segmentation is currently undergoing a paradigm shift from simple visual pattern matching to cognitive reasoning. However, existing approaches face significant limitations when dealing with complex lesions (e.g., tumors with irregular boundaries, low contrast, or pathological heterogeneity):

General Multimodal Large Language Models (MLLMs): Possess broad common sense but lack specialized visual reasoning capabilities for specific medical pathologies. They often fail to localize ambiguous lesions accurately.
Traditional Segmentation Models: Excel at pixel-level segmentation for clear anatomical structures but lack logical interpretability and struggle with complex, noisy, or heterogeneous lesions.
Existing MLLM-based Solutions:
- Supervised Fine-Tuning (SFT) End-to-End: Often have implicit reasoning processes, leading to a lack of explainability and difficulty in handling ambiguous cases.
- Cascaded RL Frameworks: Typically use a "Locate then Segment" approach (e.g., MLLM generates a bounding box $\rightarrow$ SAM segments). This suffers from error propagation, where inaccuracies in the initial localization step drastically degrade the final segmentation quality.

The core challenge is to develop a system that integrates explicit clinical reasoning with robust pixel-level segmentation in an end-to-end manner to handle complex, noisy, and heterogeneous medical lesions.

2. Methodology: CORE-Seg

The authors propose CORE-Seg, an end-to-end framework that unifies reasoning and segmentation. The methodology consists of three main components:

A. The Architecture

Multimodal Reasoning Module: Utilizes a MLLM (Qwen2.5-VL-3B) to process clinical queries and images. It generates a structured output containing:
- Chain-of-Thought (CoT): A reasoning trace describing anatomical context and lesion characteristics.
- Answer: A localization description and a special learnable token <seg>.
Semantic-Guided Prompt Adapter (SG-A): This is the core innovation bridging the MLLM and the segmentation model.
- Instead of using explicit coordinates (boxes/points) which cause error propagation, the adapter extracts the hidden state of the <seg> token.
- This hidden state acts as a semantic anchor, encoding the reasoning outcome and implicit spatial priors.
- The adapter projects this linguistic feature into the visual feature space of the Segment Anything Model (SAM) using Residual MLPs and Cross-Attention mechanisms.
Segmentation Module: Uses MedSAM 2 as the backbone. It takes the image embeddings and the semantic probe (from the adapter) to generate the final binary mask.

B. The Dataset: ComLesion-14K

To train and evaluate this system, the authors constructed ComLesion-14K, the first large-scale benchmark for reasoning-driven complex lesion segmentation.

Scale: 14,000 samples across 31 diseases, 9 anatomical regions, and 8 imaging modalities (CT, MRI, Ultrasound, etc.).
Selection Strategy: Uses a difficulty-aware filtering mechanism based on a power-law distribution of segmentation errors to select "hard" cases where standard models fail.
Annotations: Includes structured Chain-of-Thought (CoT) reasoning traces, bounding boxes, and pixel-level masks.

C. Training Strategy: Progressive SFT to GRPO

The model is trained in two stages to balance semantic alignment and reasoning exploration:

Stage 1: CoT-Based Semantic Alignment (SFT):
- Supervised Fine-Tuning to align the MLLM's reasoning with the visual localization task.
- Uses a composite loss: Text Loss (for CoT) + Dice/Cross-Entropy Loss (for segmentation).
- Establishes the mapping between the <seg> token and the segmentation decoder.
Stage 2: RL-Based Reasoning Exploration (GRPO):
- Uses Group Relative Policy Optimization (GRPO) to refine the model's ability to handle complex cases and correct errors.
- Adaptive Dual-Granularity Reward Mechanism: Designed to solve the "reward sparsity" problem (where models get zero reward if they miss the lesion entirely).
  - Format Reward: Ensures valid output structure.
  - Bipartite Matching Reward ( $r_{bbox}$ ): Evaluates detection completeness and localization precision using IoU and F1 scores.
  - Dual-Granularity Mask Reward ( $r_{mask}$ ):
    - If Dice < 0.05 (no overlap): Uses Generalized IoU (GIoU) of bounding boxes to provide a directional gradient.
    - If Dice $\ge$ 0.05: Uses standard Dice coefficient to encourage pixel-perfect precision.

3. Key Contributions

New Task Paradigm: Defined Complex Lesion Segmentation, shifting focus from commonsense organ segmentation to visual-reasoning-based segmentation of ambiguous, heterogeneous lesions.
ComLesion-14K Benchmark: The first dataset explicitly designed with CoT annotations for complex, multi-focal lesions, bridging the gap between reasoning and segmentation.
CORE-Seg Framework: An end-to-end architecture that eliminates cascading errors by using a Semantic-Guided Prompt Adapter to translate reasoning tokens directly into visual prompts, rather than relying on intermediate bounding boxes.
Progressive Training & Reward Design: A novel SFT-to-GRPO pipeline with an adaptive dual-granularity reward mechanism that effectively mitigates reward sparsity and enhances both reasoning and segmentation accuracy.

4. Experimental Results

The authors evaluated CORE-Seg against General MLLMs, Medical-Specific MLLMs, and Grounding-Specific models on the ComLesion-14K dataset and Out-of-Distribution (OOD) benchmarks.

State-of-the-Art Performance:
- Achieved a mean Dice score of 37.06%, outperforming the second-best baseline (LISA-3B) by +14.89%.
- Achieved a mean IoU of 27.79%.
Robustness & Reliability:
- Reduced the Failure Rate (complete loss of grounding or invalid format) to 18.42%, significantly lower than baselines (e.g., LISA at 44.28%, general MLLMs >37%).
- Demonstrated superior performance across 6 out of 9 anatomical regions and maintained robustness in noisy modalities like Ultrasound and OCT.
Efficiency:
- Despite using a compact 3B parameter backbone (24x smaller than Qwen2.5-VL-72B), CORE-Seg surpassed the 72B model by 26.02% in mDice, proving that explicit reasoning alignment is more effective than sheer parameter scaling.
Ablation Studies:
- Confirmed that the Semantic-Guided Prompt Adapter is crucial (replacing it with a linear layer dropped performance by ~7%).
- Validated that the RL stage (GRPO) is essential for generalization, improving mDice by ~10% on OOD datasets compared to SFT alone.
- Showed that the Dual-Granularity Reward is vital for preventing reward sparsity during training.

5. Significance

This work represents a significant step forward in medical AI by moving beyond "black-box" pixel prediction to interpretable, reasoning-driven segmentation.

Clinical Relevance: By mimicking the clinician's diagnostic process (observation $\rightarrow$ reasoning $\rightarrow$ localization), the model provides not just a mask, but a verifiable rationale, increasing trust in clinical decision-making.
Technical Advancement: It solves the critical issue of error propagation in cascaded models by unifying reasoning and segmentation into a single, differentiable pipeline.
Future Impact: The proposed framework and dataset lay the groundwork for developing more robust, explainable, and generalizable medical AI systems capable of handling the most challenging and ambiguous cases in clinical practice.