Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Imagine you have a brilliant, well-read student who has read every medical textbook in the world. This student is incredibly smart and can describe pictures in great detail. However, when you show them a picture of a sick stomach from a camera, they often make two big mistakes:

They skip the steps: Instead of looking at the picture like a doctor (checking where it is, what it looks like, and how the tiny blood vessels behave), they just guess the answer immediately.
They get distracted: They might say, "This looks like a tumor because there are bubbles in the background," when actually, the bubbles are just noise, and the real problem is hidden elsewhere.

This paper introduces a new system called CogAlign to fix these mistakes. Think of CogAlign as a rigorous medical residency program for AI.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Smart but Scattered" Student

Current AI models are like that well-read student. They can talk a lot, but they don't follow the strict, logical checklist that real gastroenterologists (stomach doctors) use.

Real Doctors: First, they locate the spot. Second, they look at the shape. Third, they zoom in on the tiny details. Finally, they make a diagnosis.
Old AI: They often jump straight to the diagnosis or make things up (hallucinate) because they see a pattern in the background (like a reflection or a bubble) that has nothing to do with the disease.

2. The Solution: The "CogAlign" Training Program

The authors built a two-step training camp to turn the AI into a disciplined doctor.

Step A: The "Checklist" Lesson (Supervised Fine-Tuning)

Imagine teaching the AI a strict recipe for looking at a stomach image.

The Dataset: The researchers created a special library of images where every single picture comes with a "thought process" written out by real experts.
The Rule: The AI isn't allowed to say "It's a polyp" until it has first written down:
1. Location: "This is in the small intestine."
2. Shape: "It looks like a bumpy mushroom."
3. Details: "The blood vessels around it look twisted."
The Result: The AI learns to think in a line, just like a human doctor. It can't skip the steps. It has to "show its work" before giving the answer.

Step B: The "What If?" Game (Counterfactual Reinforcement Learning)

This is the cleverest part. The AI still has a bad habit: it sometimes guesses based on the background (like the bubbles or the lighting) instead of the actual disease. To fix this, the researchers play a game of "What If?" with the AI.

The Trick: They take a picture of a sick stomach, but they use a digital "eraser" (a blur) to wipe out the disease, leaving only the background (bubbles, lighting, mucus).
The Test: They show this "erased" picture to the AI and ask, "What is wrong here?"
The Lesson: Since the disease is gone, the AI must say "Nothing is wrong."
- If the AI says, "It's a tumor!" (because it saw the bubbles), it gets a big penalty.
- If the AI says, "It looks normal," it gets a reward.
The Outcome: The AI learns that the bubbles don't matter. It learns that the only thing that matters is the actual lesion. It stops guessing based on distractions and starts focusing on the real evidence.

3. The Results: A New Top-Doctor

After this training, the AI became a master diagnostician.

Accuracy: It beat all other famous AI models (including big ones like Gemini and GPT) in tests.
Complex Cases: It got really good at spotting when a patient has two different diseases at the same time, which is something other AIs usually miss.
Robustness: Even when the pictures were messy, blurry, or full of bubbles, the CogAlign AI didn't get confused. It ignored the noise and found the disease.

Summary

In short, CogAlign is like taking a smart but chaotic AI and giving it:

A strict checklist to force it to think like a human doctor.
A magic eraser to teach it to ignore distractions and focus only on the real problem.

The result is an AI that doesn't just guess; it reasons, it checks its work, and it gives reliable diagnoses that doctors can actually trust.

1. Problem Statement

The paper addresses two critical limitations hindering the application of Multimodal Large Language Models (MLLMs) in gastrointestinal (GI) endoscopy:

Clinical Cognition Misalignment: General MLLMs often exhibit scattered reasoning or hallucinations, failing to follow the rigorous, hierarchical cognitive workflow of expert endoscopists (which proceeds from anatomical localization $\to$ morphological evaluation $\to$ micro-detail analysis $\to$ diagnosis).
Lack of Causal Association: Standard supervised fine-tuning (SFT) causes models to rely on spurious background correlations (e.g., image artifacts, modality context, or bubbles) rather than causal pathological features. This leads to brittle performance where models hallucinate diagnoses based on environmental noise rather than actual lesions.

2. Methodology: The CogAlign Framework

The authors propose CogAlign, a two-stage framework designed to enforce strict clinical logic and causal grounding.

Stage 1: Hierarchical Clinical Cognition Dataset & SFT

Dataset Construction: A new dataset was curated containing 24,515 endoscopic images. Unlike standard image-label pairs, this dataset includes hierarchical reasoning chains generated via a semi-automated pipeline (using Gemini 3 Pro as a teacher model) and refined by human experts.
Reasoning Structure: Each sample enforces a three-step reasoning flow before the final label:
1. Anatomical Localization: Identifying the organ segment and imaging conditions.
2. Morphological Evaluation: Assessing macroscopic features (shape, size, color).
3. Micro-detail Analysis: Scrutinizing surface textures and vascular patterns.
Supervised Fine-Tuning (SFT): The model is trained to internalize this structured trajectory, forcing the final diagnosis to be a conditional consequence of the preceding analytical steps.

Stage 2: Counterfactual-Driven GRPO for Causal Rectification

To address visual bias, the authors provide a theoretical proof demonstrating that standard SFT converges to "shortcut" solutions relying on low-complexity spurious features ( $Z_e$ ) rather than causal features ( $Z_c$ ). To rectify this, they introduce Counterfactual-Driven Group Relative Policy Optimization (GRPO):

Counterfactual Sample Synthesis: The model generates a bounding box for a lesion, which is then masked and replaced with high-intensity Gaussian smoothing to create a "counterfactual normal" sample ( $x_{cf}$ ). This sample retains the background but removes the lesion.
Reward Mechanism: The model is optimized using a composite reward function:
1. Format Reward ( $R_{fmt}$ ): Enforces the strict three-section output structure.
2. Clinical Cognition Reward ( $R_{cog}$ ): Verifies the presence of specific medical keywords (e.g., "villous," "erosion") in the reasoning chain.
3. Diagnostic Consistency Reward ( $R_{diag}$ ): Ensures the final conclusion matches the ground truth.
Causal Enforcement: If the model predicts a pathology on the counterfactual sample (where the lesion is erased), it receives a severe penalty. This forces the model to ignore background correlations and ground its diagnosis strictly in the lesion features.

3. Key Contributions

CogAlign Framework: A novel architecture bridging general MLLM capabilities with specialized clinical requirements via hierarchical cognitive tuning and counterfactual reinforcement learning.
Hierarchical Dataset: Creation of a large-scale dataset with expert-verified, step-by-step diagnostic reasoning chains, enabling the model to emulate the cognitive flow of senior endoscopists.
Theoretical & Practical Causal Alignment: A theoretical derivation proving SFT's tendency toward spurious shortcuts, coupled with a counterfactual-driven GRPO strategy that mathematically enforces causal grounding.
State-of-the-Art (SoTA) Performance: The approach achieves superior accuracy across multiple benchmarks, particularly in complex multi-label scenarios and noisy environments.

4. Experimental Results

The framework was evaluated on five GI benchmarks (CrohnIPI, GastroVision, HyperKvasir, Kvasir-Capsule, SEE-AI) involving 4,779 test samples.

Overall Performance: CogAlign (specifically the 8B variant) achieved 67.67% average accuracy, significantly outperforming:
- Proprietary General Models (e.g., Gemini 3 Pro: 24.82%, GPT-5 series: <12%).
- Specialized Medical Models (e.g., Hulu-Med-7B: 8.58%).
- Standard SFT baselines (Qwen3-VL-8B SFT: 66.31%).
Multi-Label Diagnosis: CogAlign demonstrated robustness in identifying concurrent pathologies (13.62% accuracy on multi-label SEE-AI samples), whereas other models often failed completely (0.00% for Hulu-Med).
Robustness to Noise: In tests with simulated spot interference (bubbles, mucus), CogAlign maintained high accuracy, while SFT-only baselines suffered severe degradation.
Case Studies: Visual analysis showed CogAlign correctly identifying subtle polyps and erosions obscured by noise, whereas baseline models hallucinated "normal" diagnoses or missed lesions entirely.

5. Significance

This work represents a significant shift in medical AI from "black-box" classification to interpretable, causally grounded reasoning. By explicitly modeling the expert cognitive workflow and mathematically penalizing reliance on background artifacts, CogAlign addresses the "hallucination" and "brittleness" issues prevalent in current medical MLLMs. It provides a scalable blueprint for deploying reliable AI in high-stakes clinical environments where diagnostic accuracy and explainability are paramount. The authors have committed to releasing all source code and datasets to foster further research.