GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation

Imagine you are a doctor looking at a video of a colonoscopy (a camera inside the colon). Your job is to find polyps (small growths that could become cancer) and mark them on the screen. This is incredibly hard because:

Polyps can be tiny, flat, and blend in perfectly with the surrounding tissue.
Normal folds in the colon look suspiciously like polyps.
The lighting is tricky, and the camera moves.

Current computer programs (AI) try to do this, but they often act like a camera with a fixed focus. If they zoom out to see the whole picture, they miss the tiny details. If they zoom in to see the details, they lose the context and mistake a normal fold for a polyp. They also work in a straight line: they look at the image once and make a guess, without ever "second-guessing" themselves.

GRAFNet is a new AI system designed to fix this by copying how the human brain actually sees things. Instead of just being a camera, it acts like a team of experts working together.

Here is how it works, using simple analogies:

1. The "Retina" Team (The Multiscale Retinal Module)

Think of the human eye's retina not as a single sensor, but as a team of specialists working in parallel.

The Detail Specialist: One part of the team looks at fine textures (like the roughness of a polyp).
The Shape Specialist: Another part looks at big shapes and motion (like the overall curve of a fold).
The Color Specialist: A third part looks at color contrasts.
The "No-Go" Specialist: Just like in your eye, some cells inhibit others to stop the brain from getting confused by too much noise.

In GRAFNet: Instead of forcing the AI to look at everything with one "lens," it splits the image into these different specialist streams. This allows it to see both the tiny texture of a flat polyp and the big shape of the colon at the same time, without getting confused.

2. The "Edge Detective" (The Guided Asymmetric Attention Module)

Imagine you are trying to find a specific shape in a messy room. You don't look at the whole room at once; you look for specific edges and lines.

Human brain cells in the visual cortex are tuned to specific angles (horizontal, vertical, diagonal).
In GRAFNet: This module acts like a magnetic edge detector. It specifically hunts for the boundaries of polyps. If a normal fold has a smooth curve, the AI ignores it. If there is a jagged, suspicious edge, the AI highlights it. It filters out the "noise" (normal tissue) so the doctor only sees what matters.

3. The "Manager" (The Guided Cortical Attention Feedback)

This is the most important part. Most AI systems are like a student taking a test: they read the question, write an answer, and hand it in. They never check their work.

Human Vision: When you see something ambiguous, your brain sends a signal back from the "thinking" part (cortex) to the "seeing" part (retina) saying, "Wait, that looks like a fold, not a polyp. Look closer." This is called feedback.
In GRAFNet: The system works in a loop.
1. It makes a first guess.
2. The "Manager" (Cortical Feedback) looks at the big picture and says, "That area looks suspicious, but the context suggests it's just a fold. Let's refine the guess."
3. The system goes back, adjusts its focus, and checks again.
4. It repeats this until it is confident.

This "second-guessing" mechanism prevents the AI from making silly mistakes, like thinking a shadow is a polyp.

Why is this a big deal?

The researchers tested GRAFNet on five different medical datasets (like different hospitals with different cameras and lighting).

The Result: It found 3–8% more polyps than the best existing AI, and it was 10–20% better at handling new, unseen data.
The "False Alarm" Problem: It made far fewer mistakes where it thought a normal fold was a polyp. This is crucial because false alarms waste doctors' time and cause unnecessary stress for patients.
The "Missed" Problem: It was much better at finding those tricky, flat polyps that usually hide in plain sight.

The Bottom Line

GRAFNet is like giving the computer a brain, not just a camera. By mimicking how our eyes and brains work together—using specialists for different details and a manager to check the work—it creates a system that is not only more accurate but also more trustworthy for doctors. It bridges the gap between "mathematically smart" AI and "clinically wise" medical tools.

1. Problem Statement

Polyp segmentation in colonoscopy is a critical task for colorectal cancer prevention but remains a significant challenge in medical AI due to three primary factors:

Morphological Variability: Polyps range from flat, subtle lesions (<100 pixels) to large, protruding masses (>1000 pixels), making fixed receptive fields inadequate.
Visual Ambiguity: Polyps often share strong visual similarities with normal anatomical structures (e.g., mucosal folds, vessels), leading to high false-positive rates (over-segmentation) and false negatives (missed flat lesions).
Limitations of Current AI: Existing deep learning models typically rely on unidirectional, bottom-up processing. They lack the iterative, feedback-driven mechanisms found in human vision, resulting in "attention drift" across scales and an inability to resolve diagnostic ambiguities using high-level anatomical context.

2. Methodology: GRAFNet Architecture

The authors propose GRAFNet, a biologically inspired architecture that mimics the hierarchical organization of the human visual system (retina and visual cortex). It integrates three core modules within a Polyp Encoder-Decoder Module (PEDM):

A. Guided Asymmetric Attention Module (GAAM)

Biological Inspiration: Mimics orientation-tuned neurons in the primary visual cortex (V1).
Mechanism: Uses steerable filters to process features in parallel directions (horizontal, vertical, and two diagonals) alongside center-surround features and edge-aware processing (Sobel/Laplacian).
Function: Selectively enhances diagnostically relevant polyp boundaries while suppressing anatomical noise. It dynamically weights directions based on contextual guidance to focus on anatomically plausible regions.

B. Multiscale Retinal Module (MSRM)

Biological Inspiration: Replicates the primate retina's parallel pathways (Parvocellular, Magnocellular, Koniocellular, and ON-OFF pathways).
Mechanism: Processes visual cues through four distinct streams:
- Parvocellular: Fine textures.
- Magnocellular: Broad shapes and motion.
- Koniocellular: Color contrast.
- ON-OFF: Contrast detection.
Function: These streams are synthesized using lateral inhibition (to reduce redundancy) and divisive normalization. This allows for simultaneous texture, shape, and color analysis, tailored for downstream attention guidance.

C. Guided Cortical Attention Feedback Module (GCAFM)

Biological Inspiration: Emulates predictive coding and top-down feedback loops between the prefrontal cortex and lower visual areas.
Mechanism: Takes high-level semantic features ("diagnostic hypotheses") and refines low-level feature analysis. It uses cross-attention mechanisms and context-aware gating to reconcile low-level observations with high-level expectations.
Function: Iteratively refines segmentation by resolving ambiguities (e.g., distinguishing a fold from a polyp) and preventing attention drift across different resolution scales.

D. Unified Architecture & Loss Function

PEDM: A hierarchical encoder-decoder (ResNet-34 backbone) that integrates the above modules. It enforces spatial-semantic consistency via resolution-adaptive feedback.
Bio-Inspired Loss ( $L_{BIO}$ ): A composite loss function combining Dice loss, MSRM loss, and guidance loss. Crucially, it includes a feedback consistency constraint ( $\|G_i - Attn_i\|^2$ ) to ensure the top-down feedback aligns with the generated attention maps, enforcing biological plausibility during training.

3. Key Contributions

GAAM: A novel module using steerable filters to emulate orientation-selective cortical neurons, enhancing boundary detection while suppressing noise.
MSRM: A retinal-inspired module that models parallel processing pathways (P, M, K, ON-OFF) to handle diverse polyp morphologies and reduce feature redundancy.
GCAFM: A feedback mechanism implementing predictive coding to iteratively refine low-level features using high-level anatomical priors, addressing the "static attention" limitation of current models.
Unified Framework: The integration of these modules into a single architecture that achieves state-of-the-art (SOTA) performance while offering interpretable decision pathways.

4. Experimental Results

The model was evaluated on five public benchmarks: Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-ClinicDB, and PolypGen.

Quantitative Performance:
- GRAFNet achieved SOTA performance across all datasets.
- On CVC-ClinicDB, it achieved a Dice score of 0.9290 and Accuracy of 0.9922, outperforming the second-best method (MDPNet) by ~1%.
- On CVC-ColonDB, it improved Dice by 2.2–4.0% over the runner-up.
- It demonstrated a 3–8% improvement in Dice scores and 10–20% higher generalization compared to leading methods.
Generalization (Cross-Dataset):
- Trained exclusively on Kvasir-SEG, GRAFNet was tested on unseen PolypGen data (different centers and video sequences).
- It outperformed the next best method by ~29% in Dice on Data Centre 4 and ~37% in Dice on PolypGen Sequence 15, proving exceptional robustness to domain shifts.
Specific Capabilities:
- False Positive Reduction: Reduced Haustral Fold Misclassification (HF) to 6.78% (vs. 8.45% for MDPNet), significantly lowering false alarms on normal anatomy.
- Subtle Lesion Detection: Achieved a Dice of 0.7456 for flat lesions (<3mm), a 9.8% improvement over FoBS.
- Attention Stability: Achieved 89.67% attention consistency across scales, preventing the "drift" seen in feedforward models.
Efficiency:
- Despite its complexity, GRAFNet runs at 2.77 FPS (24.85M parameters), making it 45.9% faster than comparable models like AGCNet while maintaining higher accuracy.

5. Significance

Bridging AI and Biology: GRAFNet establishes a paradigm where neural computation principles (retinal parallelism, cortical feedback) are explicitly used to solve clinical problems, moving beyond "black box" deep learning.
Clinical Trustworthiness: By mimicking human visual cognition (iterative refinement and top-down reasoning), the model offers interpretable decision pathways, reducing the risk of false positives that disrupt clinical workflows.
Robustness: The architecture's ability to generalize across diverse datasets and imaging conditions addresses a major bottleneck in deploying AI for real-world medical diagnostics.
Future Impact: The work suggests that incorporating biological feedback loops is essential for next-generation medical imaging systems, paving the way for real-time, high-precision computer-aided diagnosis (CAD) in colonoscopy.

Code Availability: The source code is available at https://github.com/afofanah/GRAFNet.