BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation

Imagine you are a doctor performing a colonoscopy. You are looking at a live video feed inside a patient's body, searching for tiny, dangerous growths called polyps. To help you, you want a computer program that can instantly highlight these polyps on the screen, drawing a perfect outline around them so you don't miss anything.

This is the problem the paper BiSe-UNet tries to solve. But there's a catch: the computer running this program isn't a giant supercomputer in a lab; it's a tiny, low-power device (like a Raspberry Pi) that might be attached to the medical camera itself. It needs to be fast enough to keep up with the video (30 frames per second) and small enough to fit on the device, without losing accuracy.

Here is the story of how they built a solution, explained with everyday analogies.

The Problem: The "Heavy" vs. The "Fast"

In the world of AI, there are two types of models for this job:

The Heavyweight Champion (Standard U-Net): This is like a professional construction crew. They build a perfect house (very accurate segmentation), but they take a long time and need a massive truck full of tools (lots of computing power). They are too slow for a live video feed on a small device.
The Speedster (Lightweight Models): These are like a single person with a paintbrush. They are incredibly fast, but because they are rushing, they often miss the corners or paint the lines crookedly. In medicine, a crooked line could mean missing a polyp, which is dangerous.

The authors asked: Can we build a team that is as fast as the speedster but as accurate as the heavyweight?

The Solution: BiSe-UNet (The "Dual-Path" Team)

The authors created a new AI model called BiSe-UNet. Think of it as a two-person team working together to draw the outline of a polyp.

1. The Two Paths (The Eyes of the Team)

Instead of one long, slow brain, this model has two distinct "paths" that look at the image simultaneously:

Path A: The "Big Picture" Detective (Context Path)
- Analogy: Imagine a detective standing on a hill looking at a whole city. They can see the layout of the streets and where the buildings are clustered. They know, "Ah, that shape is likely a house, not a tree."
- In the model: This path looks at the image from far away (downsampling). It understands the context and the general shape of the polyp but loses the tiny details.
- The Upgrade: They added an Attention Refinement Module. Think of this as the detective putting on a pair of smart glasses that say, "Hey, look right there! That's the important part!" This helps the model focus on what matters.
Path B: The "Fine Detail" Artist (Spatial Path)
- Analogy: Imagine an artist standing right next to the wall, holding a magnifying glass. They can't see the whole city, but they can see the exact texture of the brick and the tiny crack in the mortar.
- In the model: This path stays at high resolution. It doesn't shrink the image. It preserves the sharp edges and the exact boundaries of the polyp.

2. The Merge (The Handshake)

Usually, these two paths would work separately and then try to combine their notes at the very end, which is messy.

BiSe-UNet's Trick: They bring the "Big Picture" Detective and the "Fine Detail" Artist together early in the process.
Analogy: It's like the detective points to a spot on the map, and the artist immediately starts sketching the exact outline there. They combine their notes into a single, perfect drawing before they even start the final step. This ensures the outline is both contextually correct and razor-sharp.

3. The Decoder (The Efficient Builder)

Once the features are combined, the model needs to turn them back into a full-size image.

The Problem: Standard building methods are heavy and slow.
The Solution: They used Depthwise Separable Convolutions (DSConv).
Analogy: Imagine you need to paint a wall.
- Standard method: You hire a crew that paints the whole wall, then hires another crew to paint the trim, then another for the corners. Lots of people, lots of time.
- DSConv method: You hire one very efficient painter who knows exactly how to paint the wall and the trim in one smooth, specialized motion. They do 90% of the work with 10% of the effort.

The Results: Why It Matters

The team tested this new model on the Kvasir-SEG dataset (a collection of 1,000 real colonoscopy images).

Accuracy: It drew the outlines almost as perfectly as the giant, slow "Heavyweight" models.
Speed: It ran 30+ times per second on a Raspberry Pi 5 (a tiny, cheap computer the size of a credit card).
Efficiency: It used 90% less computing power than the standard models.

The Bottom Line

BiSe-UNet is like taking a high-end medical camera and making it smart enough to highlight polyps in real-time, without needing a supercomputer. It proves that you don't need a "heavy" brain to do "heavy" work; you just need the right team structure (Dual-Path) and the right tools (Attention + Efficient Building).

This means that in the future, doctors could use affordable, portable devices to get instant, life-saving feedback during procedures, right at the bedside.

1. Problem Statement

The paper addresses the critical challenge of deploying real-time medical image segmentation on resource-constrained edge devices (e.g., Raspberry Pi, Jetson).

Clinical Need: Procedures like endoscopy require real-time polyp detection (≥30 FPS) to guide surgeons effectively.
Current Limitations:
- High-Accuracy Models: Traditional U-Nets and Transformer-based models offer high accuracy but are computationally heavy, making them unsuitable for embedded hardware.
- Lightweight Models: Existing fast architectures (e.g., BiSeNet, HarDNet) often sacrifice spatial precision and boundary quality to gain speed, leading to reduced diagnostic reliability.
- The Trade-off: There is a lack of architectures that successfully balance high segmentation accuracy (Dice/IoU) with the low latency and memory footprint required for edge deployment.

2. Methodology: BiSe-UNet Architecture

The authors propose BiSe-UNet, a novel U-Net variant inspired by the dual-path structure of BiSeNet but optimized specifically for medical segmentation. The architecture consists of three main components:

A. Dual-Path Encoder

The model processes features through two parallel streams to capture both global context and fine details:

Context Path (CP):
- A deep, down-sampling branch that captures global semantic information.
- It generates multi-scale features ( $x/4, x/8, x/16, x/32$ ).
- Attention Refinement Modules (ARM): Applied at $x/16$ and $x/32$ resolutions. These modules use global average pooling and $1\times1$ convolutions (squeeze-and-excite style) to refine features and emphasize relevant context.
- The refined features are merged to create a high-level context representation ( $x^{ref}_{/16}$ ).
Spatial Path (SP):
- A shallow, high-resolution branch designed to preserve fine-grained structural details and edges.
- It processes the input through a $7\times7 $stride-2 layer, followed by two$ 3\times3 $stride-2 layers, ending at$ s/8$.
- Fusion Strategy: Instead of standard skip connections, the SP output ( $s/8$ ) is concatenated with the CP output at the $x/8$ resolution. A $1\times1 $convolution projects this combined feature ($ x'_{/8}$) to serve as a skip connection, ensuring the decoder receives detailed boundary information early.

B. Lightweight Decoder

Depthwise-Separable Convolutions (DSConv): The decoder replaces standard convolutions with DSConv blocks to drastically reduce computational cost (MACs) and parameters.
Upsampling & Fusion: The decoder performs iterative upsampling, fusing the upsampled features with the skip connections ( $x^{ref}_{/16}$ , $x'_{/8}$ , and $x/4$ ) via DSConv blocks.
Output: A final $1\times1$ prediction head generates logits, which are interpolated to the full image size.

3. Key Contributions

Novel Architecture: Introduction of BiSe-UNet, which integrates an Attention-Refined Context Path with a Shallow Spatial Path. This design ensures that while the model is lightweight, it retains access to early image details necessary for precise boundary delineation.
Efficient Decoding: Utilization of a Depthwise-Separable Convolution (DSConv) decoder. This minimizes Multiply-Accumulate Operations (MACs) and parameter count while maintaining high segmentation quality.
Optimized Fusion: A specific fusion strategy at the $1/8$ resolution scale that merges spatial and contextual features, avoiding redundant computation while maximizing boundary accuracy.
Edge Deployment: Successful demonstration of real-time inference (>30 FPS) on the Raspberry Pi 5, a highly constrained embedded device.

4. Experimental Results

The model was evaluated on the Kvasir-SEG dataset (1,000 high-resolution polyp images) and compared against U-Net, BiSeNet, and HarDNet.

Accuracy:
- Achieved a Dice Score of 0.7809 and IoU of 0.6961.
- Outperformed BiSeNet by +4.1% in Dice and +5.5% in IoU with similar parameter counts.
- Comparable accuracy to the heavy baseline U-Net (Dice 0.7900) but with significantly fewer parameters.
Efficiency & Speed:
- Parameters: Only 2.5 M (vs. 7.8 M for U-Net).
- Computational Cost: 0.97 G MACs (a >90% reduction compared to U-Net's 11.67 G).
- Raspberry Pi 5 Performance: Sustained 30.48 FPS (nearly 10× faster than U-Net and 4× faster than HarDNet) with a memory footprint of only 170 MB.
- CUDA Performance: Achieved 358 FPS, which is 65% faster than U-Net.
Ablation Study: Confirmed that the combination of dual-path encoding, early spatial-context fusion, and DSConv decoding yields the optimal trade-off between latency and accuracy.

5. Significance

Clinical Viability: BiSe-UNet bridges the gap between high-accuracy research models and deployable clinical tools. It enables real-time, computer-aided diagnosis directly on portable, low-cost hardware, removing the need for cloud connectivity or expensive workstations during procedures.
Architectural Insight: The paper demonstrates that careful architectural design (specifically the separation of spatial and context paths with attention refinement) can outperform heavier, more complex networks.
Future Impact: The model provides a robust foundation for edge AI in medicine, paving the way for adaptive quantization and multi-class segmentation in resource-limited environments.

In conclusion, BiSe-UNet successfully delivers a Pareto-optimal solution for medical image segmentation, offering high diagnostic reliability without compromising the real-time performance required for life-critical endoscopic procedures.