SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR

Imagine you are driving a car, and you need to know not just where things are, but where they are going and how fast. Is that pedestrian stepping off the curb? Is that car merging into your lane? In the world of self-driving cars and robots, this ability to predict 3D motion is called Scene Flow.

For a long time, computers have tried to solve this puzzle using two different "senses," but both had flaws:

The Camera (RGB): Like a human eye, it sees beautiful colors and textures. But if it's foggy, dark, or looking at a blank white wall, it gets confused and can't tell how far away things are.
The LiDAR: This is like a bat using sonar. It shoots out laser beams to measure exact distances in 3D. It works great in the dark, but the data is "sparse" (like a low-resolution dot-matrix printout) and lacks color or texture. It struggles to tell the difference between a flat white wall and a white car.

The Problem: The "One-Sided" Approach

Previous methods tried to solve this using only the camera or only the LiDAR.

Camera-only methods are like trying to guess the speed of a car by looking at a blurry photo; they get the texture right but often mess up the distance.
LiDAR-only methods are like trying to navigate a maze using only a few scattered dots; they know the distance but get lost on flat, featureless surfaces.

The Solution: SF3D-RGB (The "Super-Translator")

The authors of this paper, SF3D-RGB, built a new AI brain that acts like a perfect translator between these two senses. Instead of forcing the camera to act like a laser or the laser to act like a camera, they let each do what it's best at and then combine the results.

Here is how their system works, step-by-step, using a simple analogy:

1. The Two Specialists (Feature Extraction)

Imagine you are hiring two detectives to solve a crime.

Detective RGB looks at the crime scene photos. They are great at spotting patterns, colors, and textures. They build a detailed "mental map" of what things look like.
Detective LiDAR looks at the laser scan. They are great at measuring exact distances and 3D shapes. They build a precise "skeleton" of where things are.

2. The Handshake (Fusion)

In the past, these detectives might have tried to work in separate rooms and just shouted their conclusions to each other. That's inefficient.
SF3D-RGB brings them into the same room. It takes the "skeleton" from the LiDAR detective and projects the "texture" from the RGB detective onto it.

Analogy: Imagine taking a wireframe model of a car (LiDAR) and painting it with a high-definition photo (RGB). Now you have a model that knows exactly where the car is and what it looks like. This creates a "super-feature" that is stronger than either one alone.

3. The Matchmaker (Graph Matching & Optimal Transport)

Now the system needs to figure out how things moved between two moments in time (Frame A and Frame B).

The Old Way: Some systems tried to check every single point against every other point. This is like trying to find a specific person in a crowd of a million people by asking everyone, "Are you him?" It's slow and computationally heavy.
The SF3D-RGB Way: They use a mathematical trick called Optimal Transport (specifically the Sinkhorn algorithm).
- Analogy: Imagine you have a pile of red blocks (Frame A) and a pile of blue blocks (Frame B). You need to move the red blocks to match the blue ones with the least amount of effort. The algorithm acts like a super-efficient logistics manager. It doesn't guess; it calculates the most efficient way to "transport" the points from one frame to the next, creating a "matching matrix" that tells the system exactly which point moved where.

4. The Polish (Refinement)

Even the best matchmaker makes small mistakes. The final step is a "Refinement Module."

Analogy: Think of this like a spell-checker or a photo editor. The system looks at its initial guess, sees where it was slightly off, and makes tiny adjustments to smooth out the motion. This ensures the final result is crisp and accurate.

Why is this a Big Deal?

The paper highlights three major wins for SF3D-RGB:

Accuracy: By combining the "eyes" (RGB) and the "ruler" (LiDAR), the system is much better at guessing motion than using just one. It handles tricky situations (like a car driving into a shadow) much better.
Efficiency: Many other systems that try to do this are like supercomputers—they need massive, expensive graphics cards to run. SF3D-RGB is "lightweight." It's like a smart, compact car that gets great gas mileage. It achieves high accuracy with fewer "parameters" (brain cells) and runs faster on standard hardware.
Real-World Ready: They tested it on real driving data (from the KITTI dataset), not just fake computer simulations. It proved that this method works on actual roads with real cars and pedestrians.

The Bottom Line

SF3D-RGB is a clever new way to teach computers to "see" motion in 3D. Instead of relying on a single, imperfect sense, it fuses the rich detail of a camera with the precise distance of a laser scanner. It does this efficiently, making it a strong candidate for the next generation of self-driving cars and robots that need to understand the world around them quickly and accurately.

1. Problem Statement

Scene flow estimation aims to perceive the 3D motion field of a dynamic scene, a critical task for robotics, autonomous driving, and augmented reality. While learning-based approaches have achieved success using single modalities, they face distinct limitations:

Image-based methods (RGB): Rely on constructing high-dimensional cost volumes, which are computationally expensive and inefficient. They also struggle in textureless areas or poor lighting conditions.
LiDAR-based methods: Provide accurate 3D measurements but process unstructured, sparse data. They often require intermediate representations (losing resolution) or $k$ -NN searches (computationally heavy). They also struggle with homogeneous or coplanar regions where geometric features are ambiguous.
Existing Fusion Methods:
- 2D Fusion (e.g., LiDAR-Flow): Projects 3D data to 2D, losing geometric detail and causing multiple points to map to the same pixel.
- Early 3D Fusion (e.g., FESTA): Concatenates RGB intensities directly with 3D coordinates. This often diminishes the inherent density of RGB features and fails to fully exploit rich semantic information.
- Dense Methods (e.g., RAFT-3D): Require massive memory and processing power, making them unsuitable for real-time applications.

The core challenge is to develop an efficient, end-to-end architecture that fuses monocular RGB and sparse LiDAR data to achieve robust, accurate, and real-time sparse scene flow estimation without the computational overhead of dense cost volumes.

2. Methodology: SF3D-RGB Architecture

The authors propose SF3D-RGB, a deep learning framework designed to balance accuracy and efficiency. It operates in the 3D domain but leverages 2D image features to enhance 3D point cloud representations. The architecture consists of five key modules:

A. Feature Extraction

RGB Feature Pyramid Network (FPN):
- Processes consecutive monocular images ( $I_t, I_{t+1}$ ).
- Extracts multi-scale semantic features using a standard FPN pipeline (four levels: 16, 32, 64, 128 strides) with LeakyReLU and instance normalization.
- Outputs coarse-scale features ( $f_{RGB}$ ) which are rich in texture and semantic context.
Pointwise Feature Extraction (FE):
- Processes consecutive LiDAR point clouds ( $PC_t, PC_{t+1}$ ).
- Inspired by PointNet, it uses Graph Convolution to extract features from unordered raw points without intermediate representations.
- For each point, it selects $k$ -nearest neighbors ( $k=32$ ), constructs edge features by concatenating neighbor features with relative spatial offsets, and encodes them via MLPs and max-pooling.

B. Fusion Module (Late Fusion)

Strategy: Unlike early fusion (concatenating raw RGB values to 3D coordinates), SF3D-RGB employs a late fusion strategy.
Process:
1. 3D points are projected onto the 2D image plane using camera intrinsics to retrieve corresponding coarse RGB features ( $f_{RGB}$ ).
2. These RGB features are concatenated with the extracted LiDAR features ( $f_{PC}$ ).
3. The concatenated vector is passed through a 256-channel MLP to produce robust fused features ( $f_t, f_{t+1}$ ).
Benefit: This preserves the geometric integrity of the point cloud while enriching it with semantic texture information, particularly helpful in geometrically homogeneous regions.

C. Graph Matching (Optimal Transport)

Mechanism: The fused features are fed into a Graph Matching Module based on Optimal Transport (OT) using the Sinkhorn algorithm.
Cost Matrix: The transport cost $C_{ij}$ is calculated using the cosine distance between fused features of source point $i$ and target point $j$ , combined with a displacement constraint (masking points with displacement $> 10m$ ).
Regularization: To handle occlusions and mass preservation violations, the authors introduce a Kullback-Leibler (KL) divergence term and an entropy term ( $H(T)$ ). The objective function minimizes transport cost while allowing mass variation via learnable parameters $\lambda$ and $\epsilon$ .
Output: This yields a soft assignment matrix ( $T^*$ ) representing the probability of correspondence, from which an initial scene flow ($sf'$) is computed.

D. Refinement Module (RF)

A residual network refines the initial flow estimate.
It takes the initial flow ($sf'$) as input, processes it through an MLP, and adds the residual output to the initial estimate to produce the final scene flow ( $sf_{est}$ ).

3. Key Contributions

Novel Architecture: Introduction of SF3D-RGB, an end-to-end neural network for sparse scene flow estimation that effectively fuses monocular RGB and LiDAR data.
Robust Fusion Strategy: A late-fusion approach that projects 3D points to 2D to extract deep RGB features, which are then fused with 3D point features. This overcomes the limitations of early fusion (loss of density) and 2D projection (loss of geometry).
Optimal Transport Integration: Adaptation of the Sinkhorn algorithm for graph matching using fused features, enabling the computation of a reliable optimal assignment matrix even in the presence of occlusions.
Efficiency: The model is lightweight, using significantly fewer parameters than state-of-the-art dense fusion methods (e.g., CamLiFlow, DELFlow) while maintaining high accuracy. It avoids the construction of expensive cost volumes.

4. Experimental Results

The method was evaluated on the FlyingThings3D (FT3D) synthetic dataset and real-world KITTI datasets (KITTId and lidarKITTI).

Performance on FT3D:
- SF3D-RGB outperforms LiDAR-only baselines (e.g., FLOT, FlowNet3D) and early fusion methods.
- It achieves an EPE3D of 0.102m and EPE2D of 5.03px, significantly better than FLOT (0.156m EPE3D).
- It uses only 0.48M parameters, which is fewer than almost all competing fusion methods (e.g., DeepLiDARFlow uses 82M).
- It runs faster than DeepLiDARFlow on the same hardware (RTX2080Ti).
Performance on KITTI (Real-world):
- Without Fine-tuning: SF3D-RGB significantly outperforms LiDAR-only methods (FLOT, FlowStep3D) on both KITTId and lidarKITTI, demonstrating the value of RGB fusion.
- With Fine-tuning: While CamLiFlow achieves slightly higher accuracy, SF3D-RGB remains highly competitive (EPE3D: 0.078m vs. CamLiFlow's 0.067m) but with a much smaller model footprint and single-stage fusion pipeline.
Ablation Studies:
- Confirmed that Late Fusion (proposed) outperforms Early Fusion (concatenating raw RGB to XYZ).
- Confirmed that a single MLP in the fusion module is more effective than a double MLP structure.
- Validated the importance of the learnable regularization parameters ( $\lambda, \epsilon$ ) for handling occlusions.

5. Significance and Conclusion

SF3D-RGB addresses a critical gap in autonomous perception by providing a computationally efficient solution for scene flow estimation that does not rely on dense point clouds or expensive stereo cameras.

Practical Impact: Its low parameter count and single-stage fusion make it suitable for deployment on low-power GPUs and embedded systems in autonomous vehicles.
Robustness: By combining the geometric precision of LiDAR with the semantic richness of RGB, the model handles challenging scenarios (textureless surfaces, occlusions) better than single-modality approaches.
Future Work: The authors aim to extend the architecture to handle high-density point clouds (currently limited to ~4K points due to Sinkhorn complexity) by implementing chunking strategies.

In summary, SF3D-RGB demonstrates that a carefully designed fusion of sparse LiDAR and monocular RGB, processed via optimal transport, can achieve state-of-the-art accuracy with superior efficiency compared to existing dense or multi-stage fusion methods.