A Contrastive Fewshot RGBD Traversability Segmentation Framework for Indoor Robotic Navigation

Imagine you are teaching a robot to walk through a busy house without bumping into anything. This is the core challenge of indoor robotic navigation. The robot needs to know exactly where it can walk (the "free space") and where it cannot (obstacles like chairs, tables, or walls).

This paper presents a new, smarter way to teach robots this skill, especially when you don't have thousands of labeled examples to show them. Here is the breakdown using simple analogies:

1. The Problem: The "Invisible" Chair Leg

Most robots today rely on cameras (vision) to see the world. It's like a human walking with their eyes closed, trying to guess where the floor is.

The Issue: Cameras are great at seeing big things like walls or sofas. But they are terrible at spotting thin, tricky things like the thin legs of a dining chair or a wire on the floor.
The Risk: If a robot doesn't see a chair leg, it might trip, fall, or knock the chair over.
The Old Solution: Some robots use 3D depth cameras (like a high-tech version of a flashlight that measures distance). But these are expensive, heavy, and rare on everyday robots. Most real-world robots (like vacuum cleaners or delivery bots) only have a simple 1D laser scanner—a single line of laser that sweeps back and forth. It's like having a single ruler instead of a full 3D map.

2. The New Idea: "Learning from Few Examples" (Few-Shot Learning)

Usually, to teach a robot to recognize a floor, you need to show it thousands of photos of floors. This is expensive and slow.

The Analogy: Imagine trying to teach a child what a "dog" is. The old way is to show them a photo encyclopedia of 10,000 dogs. The Few-Shot way is to show them just one or five pictures of a dog and say, "This is a dog. Now, find the dog in this new picture."
The Challenge: If you only show the robot one picture of a carpet, it might think all carpets are safe to walk on, even if the new room has a slippery tile floor that looks similar. It gets "stuck" on that one example.

3. The Solution: The "Good Cop, Bad Cop" Team

The authors created a framework called NCL (Negative Contrastive Learning). Think of the robot's brain as a detective team with two agents:

Agent A (The Positive Prototype / The "Good Cop"): This agent looks at the "Support" image (the example you gave it) and says, "Look! This is a safe floor. Find things that look like this!"
Agent B (The Negative Prototype / The "Bad Cop"): This is the paper's big innovation. Instead of just looking for the floor, this agent looks at the obstacles in the example and says, "Look! These are chair legs and walls. Do NOT walk here."

Why is this better?
If you only have the "Good Cop," the robot might get confused between a white wall and a white floor. But if you have the "Bad Cop" shouting, "That looks like a wall! Don't go there!", the robot becomes much smarter. It learns by knowing what not to touch, not just what to touch.

4. The Magic Glue: The "Two-Stage Attention"

There's a technical hurdle: The robot's laser scanner gives a single line of data (1D), but the camera gives a full picture (2D). They don't naturally line up.

The Analogy: Imagine trying to paste a long strip of stickers (the laser data) onto a square poster (the camera image), but the stickers are stretched and crooked.
The Fix: The authors built a special "glue" module (the Two-Stage Attention).
1. Horizontal Alignment: It first stretches the laser line to match the width of the picture.
2. Vertical Alignment: Then, it intelligently projects that line up and down to fill the whole height of the picture, guessing where the floor and ceiling are based on the laser's distance.
- This allows the robot to "see" the 3D shape of the room using only a cheap, single-line laser.

5. The Results: Safer Navigation

The team tested this on a custom dataset of indoor rooms.

The Outcome: Their robot could spot thin chair legs that other robots missed.
The Score: In tests where the robot only saw 1 or 5 examples of a room, their method was 9% more accurate than the best existing methods.
Efficiency: They achieved this without needing a supercomputer. Because they only "taught" the specific glue and the "Good/Bad Cop" agents (leaving the rest of the brain frozen), it was fast and cheap to run.

Summary

This paper is about teaching robots to navigate indoor spaces safely using cheap sensors and very little training data.

They use a single laser line instead of expensive 3D cameras.
They teach the robot using few examples (1 or 5 images).
They use a "Good Cop, Bad Cop" strategy, teaching the robot to recognize both the floor and the obstacles to avoid confusion.
The result is a robot that is much less likely to trip over a chair leg, making it safer for hospitals, hotels, and homes.

1. Problem Statement

The paper addresses indoor traversability segmentation, a critical task for autonomous robots to identify safe, navigable free space while avoiding obstacles. The authors identify three primary challenges in existing approaches:

Failure to Detect Thin Obstacles: Pure vision-based models (e.g., Deeplabv3+, SegFormer) often fail to detect thin structures like chair legs. While these occupy few pixels, missing them poses severe safety risks.
Data Scarcity & Annotation Cost: Training robust segmentation models typically requires large-scale, fine-grained labeled datasets, which are expensive and time-consuming to acquire for indoor environments.
Sensor Limitations & Misalignment: Many commercial indoor robots use lightweight, low-cost 1D LiDARs rather than dense 2D/3D depth cameras. Furthermore, these 1D depth vectors are often unregistered with RGB images (misaligned vertically and horizontally), creating a technical challenge for fusing sparse depth with 2D visual data.

2. Methodology

The authors propose a Multi-modal Few-Shot Segmentation (FSS) framework that integrates RGB images with sparse 1D laser depth data. The architecture consists of four key components:

A. Two-Stage Attention Depth Module

To handle the unregistered 1D depth vectors (size $1 \times 360 $) and align them with RGB images ($ 640 \times 480$), the authors designed a novel backbone that does not require explicit registration:

Horizontal Attention (Beam Alignment): Uses self-attention to map the 1D depth vector to the horizontal beam positions of the RGB image, learning an embedding that aligns depth values with image columns.
Vertical Attention (Height Projection): Takes the horizontally aligned features and applies a second attention mechanism to project them vertically, generating a depth map ($640 \times 480$) that matches the image height.
This allows the model to capture geometric interactions dynamically without manual calibration.

B. Multi-Modal Fusion

RGB features (extracted via a lightweight CNN) and the processed depth features are fused using a multi-modal fusion block (compatible with backbones like DFormer or CMX) to create unified support and query feature representations.

C. Negative Contrastive Learning (NCL) Branch

Traditional FSS relies solely on positive prototypes (traversable free space) to match query pixels. The authors argue this leads to overfitting and confusion between similar textures (e.g., white walls vs. white tiles).

Positive-to-Prototype (p2p): Computes cosine similarity between query features and the positive support prototype (free space).
Negative-to-Prototype (n2p): Computes cosine similarity between query features and the negative support prototype (obstacles).
Mechanism: The model extracts both positive ( $s^+$ ) and negative ( $s^-$ ) prototypes from the support set via mask pooling. The query features are refined by comparing them against both. The final decoder concatenates the positive ( $q^+$ ) and negative ( $q^-$ ) query features to generate the segmentation mask.
Benefit: This non-parametric approach explicitly "expels" obstacle regions, improving generalization without adding trainable parameters.

D. Training Strategy

The framework adopts an episodic learning protocol.

Frozen Backbones: The RGB backbone and fusion modules are pre-trained and frozen to minimize learnable parameters.
Adaptive Modules: Only the Two-Stage Depth Module and the Lightweight Decoder are updated during the few-shot adaptation phase.
Data: The model learns from a small "support set" (K-shot) to generalize to a "query set."

3. Key Contributions

Novel Framework: The first work to explore Few-Shot RGB-1D Depth Traversability Segmentation, bridging multi-modal fusion with FSS in a challenging real-world setting.
Two-Stage Attention Depth Module: A mechanism to dynamically align sparse, unregistered 1D laser scans with 2D RGB images, eliminating the need for explicit sensor registration.
Negative Contrastive Learning (NCL): A non-parametric branch that leverages negative prototypes (obstacles) to refine free-space predictions, significantly reducing overfitting and improving the detection of thin obstacles.
New Dataset: Collection and release of a large-scale indoor RGB-D traversability dataset containing 91,951 paired samples (RGB + 1D depth), with 2,553 manually annotated masks, covering diverse indoor environments (offices, labs, corridors).

4. Experimental Results

The method was evaluated on the custom dataset under 1-shot and 5-shot settings, comparing against state-of-the-art FSS methods (PANet, CWT, BAM) and RGB-D baselines.

Quantitative Performance:
- The proposed NCL method achieved the highest mIoU across all settings.
- 1-shot setting: Achieved 88.95% mIoU (using DFormer backbone), outperforming the next best method (BAM) by ~7.5%.
- 5-shot setting: Achieved 90.56% mIoU.
- Efficiency: The method requires significantly fewer trainable parameters (only ~4.6M out of ~59.7M total) because most backbones are frozen.
Ablation Studies:
- Removing the Two-Stage Depth Module dropped mIoU by 11.5%, proving the necessity of proper depth alignment.
- Removing the NCL branch (using only p2p) dropped mIoU by 8.3%, with the obstacle class performance dropping by 11.4%, confirming the critical role of negative prototypes in rejecting thin obstacles.
Qualitative Results: Visual comparisons showed that while vision-only or depth-only models failed to detect chair legs or confused walls with floors, the full model successfully segmented thin obstacles and maintained clean free-space boundaries.

5. Significance

This work provides a practical solution for indoor robotic navigation in resource-constrained environments. By utilizing low-cost 1D LiDARs and requiring minimal labeled data, the framework makes robust traversability analysis accessible for commercial robots (e.g., cleaning, delivery, and assistive robots). The introduction of Negative Contrastive Learning offers a new paradigm for few-shot segmentation, demonstrating that explicitly modeling "what is not traversable" is as important as modeling "what is traversable," particularly for safety-critical applications involving thin obstacles.