RDFC-GAN: RGB-Depth Fusion CycleGAN for Indoor Depth Completion

The Big Problem: The "Ghostly" Room

Imagine you are trying to build a 3D model of your living room using a special camera (like a Kinect or a robot's eye). You expect to see the walls, the sofa, and the coffee table.

But instead, the camera gives you a map full of holes.

Glass windows? The camera sees right through them, leaving a blank spot.
Shiny mirrors or black velvet? The light bounces away or gets absorbed, so the camera thinks there is nothing there.
Far corners? The signal gets too weak to measure.

The result is a "depth map" (a picture of how far away things are) that looks like Swiss cheese. This is a nightmare for robots trying to navigate or for augmented reality apps trying to place a virtual chair in your room. They don't know where the floor ends and the wall begins.

The Solution: The "Two-Chef" Kitchen

The authors of this paper built a new AI system called RDFC-GAN to fix these holes. Think of it as a kitchen with two expert chefs working together to cook the perfect meal (the complete depth map).

Chef 1: The "Architect" (The MCN Branch)

Who they are: This chef is a stickler for rules and geometry. They know that most houses are built with straight lines, right angles, and flat surfaces (this is called the Manhattan World Assumption—like a city grid).
What they do: They look at the raw, holey data and say, "Okay, this wall must be vertical, and this floor must be flat." They use the RGB image (the color photo) to guess the orientation of the walls.
The Result: They produce a depth map that is structurally correct and smooth. It knows where the walls should be, but it might look a bit blurry or lack fine details (like the texture of a brick wall).

Chef 2: The "Artist" (The RDFC-GAN Branch)

Who they are: This chef is a creative genius who loves texture and detail. They are trained using a special technique called a CycleGAN (a type of AI that learns to translate one style of image into another).
What they do: They look at the color photo and say, "If I see a wooden door here, the depth map should look like wood, not just a flat gray blob." They try to "paint" the missing depth values by mimicking the textures in the color photo.
The Result: They produce a depth map that is rich in detail and looks realistic, but sometimes they might get a little carried away and add "noise" or make things look a bit wobbly.

The "Taste Tester" (The Fusion Head)

Now, you have two dishes: one is structurally perfect but bland, and the other is flavorful but messy. You need a Taste Tester to combine them.

The system uses a special module called W-AdaIN (Weighted Adaptive Instance Normalization) to mix the two chefs' outputs.
It acts like a smart editor: "In this area, the Architect is right (it's a flat wall), so I'll use their version. In this area, the Artist is right (it's a complex chair), so I'll use their version."
The result is a final depth map that is both structurally sound and full of realistic details.

The Secret Ingredient: "Fake" Training Data

One of the biggest hurdles in training these AI chefs is that you can't just show them a "perfect" room and a "holey" room to learn from. Real holey rooms are messy in unpredictable ways.

The authors invented a way to create "Pseudo Depth Maps" (fake holey maps) for training:

The "Highlight" Trick: They look for shiny spots in the color photo and pretend the depth sensor failed there (because shiny things confuse sensors).
The "Dark" Trick: They look for black areas and pretend the sensor failed there (because dark things absorb light).
The "Glass" Trick: They use AI to find windows and mirrors in the photo and erase the depth data there.

By training the chefs on these simulated disasters, the AI learns exactly how to fix the real-world problems it will face later.

Why This Matters

Previous methods tried to fix these holes by just "guessing" based on nearby pixels, which often resulted in blurry, smeared images.

RDFC-GAN is special because:

It respects the rules of architecture (straight walls, flat floors).
It respects the art of texture (wood grain, fabric, glass).
It trains on realistic "disasters" rather than random holes.

The Bottom Line

Imagine trying to finish a jigsaw puzzle where half the pieces are missing.

Old methods tried to fill the gaps with a blurry marker.
RDFC-GAN brings in an Architect to draw the straight lines, an Artist to paint the details, and a Smart Editor to glue them together perfectly.

The result? A robot can finally "see" the room clearly, avoiding glass doors and navigating around furniture without crashing. This makes indoor navigation, robot vacuuming, and augmented reality much safer and more accurate.

1. Problem Statement

Indoor depth sensors (e.g., Kinect, RealSense, Xtion) frequently produce incomplete depth maps due to inherent limitations and environmental factors. Common issues include:

Missing Data: Transparent materials (glass, windows) and highly reflective or smooth surfaces (polished floors, ceilings) fail to reflect light correctly, causing "holes" in the depth map.
Measurement Errors: Extreme distances and oblique incidence angles lead to inaccuracies.
Limitations of Existing Methods:
- Most existing depth completion methods are designed for sparse inputs (e.g., downsampled LiDAR data) rather than the dense but missing patterns found in indoor RGB-D sensors.
- Standard methods often produce blurry results or fail to reconstruct large contiguous missing regions common in indoor scenes.
- They often neglect the specific geometric regularities of indoor environments (Manhattan World assumption) and the deep correlation between RGB semantic features and depth textures.

2. Methodology: RDFC-GAN

The authors propose RDFC-GAN, a novel two-branch end-to-end fusion network designed to predict dense, high-fidelity depth maps from raw RGB images and incomplete depth maps.

A. Architecture Overview

The model consists of two parallel branches that are fused via adaptive modules:

Manhattan-Constraint Network (MCN) Branch:
- Goal: Regress local dense depth values by leveraging the geometric regularity of indoor scenes.
- Manhattan Normal Module: Utilizes the "Manhattan World Assumption" (indoor surfaces are orthogonal to three dominant axes). It employs a segmentation network to identify floors, ceilings, and walls, and a U-Net generator to predict surface normals.
- Constraints: Enforces physical orientation constraints (e.g., floors point up, ceilings point down, walls are horizontal) via specific loss functions.
- Structure: An encoder-decoder (ResNet-18 based) that takes the raw depth map and the generated normal map to output a Local Depth Map ( $d_l$ ) and a Local Confidence Map ( $c_l$ ).
RGB-Depth Fusion CycleGAN (RDFC-GAN) Branch:
- Goal: Generate fine-grained, textured depth maps by translating RGB imagery into depth, ensuring high fidelity through cycle consistency.
- Structure: A GAN-based framework where the generator takes the latent depth vector (from the MCN branch) and the RGB image as conditions.
- CycleGAN Mechanism: Includes an auxiliary generator to translate depth back to RGB and a discriminator to distinguish real vs. fake images. This ensures the generated depth preserves the original scene's texture and structure.
- Output: A Fused Depth Map ( $d_f$ ) and a Fused Confidence Map ( $c_f$ ).

B. Fusion Mechanism: W-AdaIN

To effectively combine the geometric precision of the MCN branch with the textural detail of the CycleGAN branch, the authors introduce Weighted Adaptive Instance Normalization (W-AdaIN).

Unlike standard AdaIN, W-AdaIN uses self-attention mechanisms on both input feature maps (RGB and Depth) to dynamically calculate scaling and bias parameters.
This allows the model to subtly control the fusion strength at different network stages, ensuring the depth features guide the RGB branch without losing semantic content.

C. Confidence Fusion Head

The final depth map ( $d_{pred}$ ) is computed by weighting the outputs of the two branches based on their respective confidence maps:
$d_{pred} = \frac{e^{c_l} \cdot d_l + e^{c_f} \cdot d_f}{e^{c_l} + e^{c_f}}$
This ensures that regions with valid raw depth rely more on the MCN branch, while missing/noisy regions rely on the textural details from the CycleGAN branch.

D. Training Strategy: Pseudo Depth Maps

The authors argue that standard random sparse sampling (used in outdoor LiDAR tasks) is unsuitable for indoor data. They propose Pseudo Depth Maps that mimic real indoor missing patterns using five synthetic masking strategies:

Highlight Masking: Masks shiny surfaces (specular highlights).
Black Masking: Masks dark, matte surfaces that absorb IR light.
Graph-based Segmentation: Masks small blocks to simulate chaotic light reflections.
Semantic Masking: Masks specific objects known to cause issues (e.g., mirrors, windows).
Semantic XOR Masking: Masks complex regions where segmentation algorithms fail.

3. Key Contributions

Novel Architecture: Proposed RDFC-GAN, a two-branch network fusing a Manhattan-constrained geometric branch with an RGB-depth CycleGAN branch.
Manhattan Constraint: First integration of the Manhattan World assumption directly into a deep learning depth completion framework to enforce structural regularity in indoor scenes.
Advanced Fusion: Developed W-AdaIN modules to effectively fuse multi-modal features (RGB and Depth) with attention-based weighting.
Realistic Training Data: Introduced a suite of "Pseudo Depth Map" sampling methods that accurately simulate indoor sensor failure modes, addressing the gap between outdoor LiDAR training and indoor RGB-D reality.
State-of-the-Art Performance: Demonstrated superior results on standard benchmarks, particularly in handling large missing regions and preserving texture.

4. Experimental Results

The method was evaluated on NYU-Depth V2 and SUN RGB-D datasets.

Quantitative Performance (NYU-Depth V2, Setting A - Raw to Dense):
- RMSE: 0.120 (Best), outperforming the previous best (GraphCSPN at 0.133) and the authors' preliminary model RDF-GAN (0.139).
- Relative Error (Rel): 0.012 (Best).
- Accuracy ( $\delta_{1.25}$ ): 98.8%.
- Point Cloud Metrics: Achieved the lowest Chamfer Distance (33.15) and highest F1 score (0.95), indicating superior geometric reconstruction.
Robustness:
- Outperformed baselines significantly in the R $\Rightarrow$ T setting (Raw to Dense), which is the most realistic indoor scenario.
- Showed strong generalization on the diverse SUN RGB-D dataset, achieving the best RMSE (0.214) and Rel (0.040).
Downstream Task (3D Object Detection):
- When used as input for 3D object detectors (VoteNet, H3DNet) on SUN RGB-D, RDFC-GAN improved detection mAP scores significantly compared to other completion methods, proving the utility of the completed depth maps for downstream applications.

5. Significance

This paper addresses a critical gap in computer vision: the transition from outdoor LiDAR-based depth completion to indoor RGB-D sensor completion. By acknowledging that indoor missing data is dense and contiguous (not sparse and random) and by leveraging structural priors (Manhattan world) alongside generative texture synthesis (CycleGAN), RDFC-GAN sets a new state-of-the-art. The proposed training strategy using pseudo depth maps provides a blueprint for training robust models specifically for indoor environments, enhancing not just depth estimation but also downstream tasks like 3D object detection and navigation.