SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

Imagine you are driving a car. To drive safely, you need to know two things: what is around you (the shape of the world) and what is moving (where the other cars and pedestrians are going).

For a long time, teaching computers to do this was like trying to teach a child to draw by showing them a finished masterpiece and saying, "Copy this exactly." The computer needed expensive, human-made labels for every single frame of video, telling it exactly where every car was and how fast it was moving. This is slow, costly, and hard to scale.

SelfOccFlow is a new method that teaches the computer to learn this skill all by itself, without a teacher. Here is how it works, broken down into simple concepts:

1. The "Static vs. Dynamic" Split

Imagine you are looking out the window of a moving train. The trees and mountains (static objects) seem to slide by, while a bird flying alongside the train (a dynamic object) moves differently.

Old methods tried to figure out the whole scene at once, which got confusing when things moved. SelfOccFlow is smarter. It splits the world into two separate mental maps:

The Static Map: This holds the road, buildings, and trees. Since these don't move, the computer can look at them from different angles over time to build a perfect, 3D model of the road.
The Dynamic Map: This holds the cars, people, and bikes. This map is allowed to change and flow.

By separating them, the computer doesn't get confused when a car drives past a building. It knows the building stays put, and the car moves.

2. Learning by "Time-Traveling" (Temporal Aggregation)

How does the computer learn what's moving without being told? It uses time.

Imagine you are taking a video of a soccer game. If you look at the ball in one frame, then the next frame, and then the one after, you can guess where the ball is going just by seeing how its position changes.

SelfOccFlow does this with 3D space. It looks at the scene at time $t$ , then $t-1$ (the past), and $t+1$ (the future).

For the Static Map, it stacks these views on top of each other like a deck of cards to make the 3D shape of the road super clear.
For the Dynamic Map, it tries to "warp" or stretch the past and future views to match the current view. If the computer has to stretch the image a lot to make the past car match the current car, it learns: "Ah, that car moved fast!" This is how it learns motion without ever seeing a speedometer.

3. The "Similarity Detective" (The Secret Sauce)

This is the most creative part. Usually, to teach a computer about motion, you need a "ground truth" (a correct answer key). SelfOccFlow doesn't have that. So, it creates its own clues.

Think of the computer's brain as having a "feature map"—a list of descriptions for every part of the image (e.g., "red car," "gray road").

The computer looks at a specific spot in the current frame (say, a red car).
It then looks at the previous frame and asks: "Where does this 'red car' description look most similar?"
If the "red car" description in the current frame matches the spot two pixels to the left in the previous frame, the computer deduces: "The car must have moved two pixels to the right."

It uses cosine similarity (a fancy math way of saying "how much do these two things look alike?") to generate its own "pseudo-labels" (fake but very good guesses) for motion. It's like solving a puzzle by matching patterns rather than reading the instructions.

4. Why This Matters

No Expensive Labels: You don't need armies of humans to label videos. The car learns by watching the world move.
Better in the Dark: Because it uses the "Static Map" to build a solid foundation, it can figure out what's behind a parked car (occluded areas) better than previous methods.
Faster and Lighter: The paper shows this new method is much less computationally heavy than its competitors. It's like upgrading from a massive supercomputer to a sleek smartphone while getting better results.

The Bottom Line

SelfOccFlow is like teaching a self-driving car to understand the world by giving it a pair of 3D glasses and a time machine. It separates the moving parts from the stationary parts, uses the passage of time to figure out speed, and uses pattern matching to teach itself the rules of motion. It's a major step toward cars that can truly "see" and understand their dynamic environment without needing a human to hold their hand.

1. Problem Statement

Autonomous driving requires robust 3D occupancy and motion (flow) estimation to understand the vehicle's surroundings. While existing methods can jointly predict geometry and motion, they face significant limitations:

Dependency on Expensive Annotations: Most state-of-the-art models rely on costly 3D occupancy labels or 3D flow annotations, which are difficult to acquire.
Reliance on External Supervision: Some self-supervised approaches (e.g., LetOccFlow) still depend on pretrained 2D optical flow models or pseudo-labels derived from them, which introduces external bias and complexity.
Dynamic Scene Complexity: Handling moving objects in self-supervised settings is challenging because standard temporal consistency assumptions fail when objects move, leading to geometric inconsistencies.

Goal: Develop an end-to-end self-supervised method that jointly learns 3D occupancy and scene flow using only raw sensor data (cameras and LiDAR), eliminating the need for human-produced 3D labels, flow annotations, or pretrained optical flow networks.

2. Methodology

The proposed SelfOccFlow framework is built on Neural Fields (specifically Signed Distance Fields - SDFs) and leverages spatio-temporal consistency and foundation model guidance.

A. Static-Dynamic Disentanglement

Instead of treating the scene as a monolithic volume, the model disentangles the scene into two separate SDFs:

Static SDF ( $\phi_s$ ): Represents stationary elements (roads, buildings).
Dynamic SDF ( $\phi_d$ ): Represents moving objects (cars, pedestrians).
Blending: The final scene is the minimum of the two fields ( $\phi_b = \min(\phi_s, \phi_d)$ ), approximated via a soft-min (softmax) function for differentiability.
Semantic Guidance: Unlike previous dynamic NeRFs that separate based on instantaneous motion, SelfOccFlow uses a foundation model (Grounded-SAM) to generate dynamic masks (e.g., cars, pedestrians) from 2D images. These masks classify LiDAR rays as static or dynamic, ensuring the separation is based on semantic classes rather than ambiguous motion states.

B. Architecture & Temporal Aggregation

The model uses a BEVFormer backbone to extract multi-view features, splitting attention heads into static and dynamic branches.

Temporal Aggregation: To enforce consistency, predictions from adjacent frames ( $t-1, t, t+1$ $t - 1, t, t + 1$ ) are aggregated.
- Static Field: Aggregated directly (aligned by ego-motion) since static objects do not move relative to the world.
- Dynamic Field: Aggregated using flow warping. The sampling positions in $t-1$ and $t+1$ are warped by the predicted flow vectors ( $f_{t-}, f_{t+}$ ) before aggregation. This mechanism implicitly trains the flow head: accurate flow predictions are required to align the dynamic geometry across time.

C. Self-Supervised Flow Loss (Similarity Flow)

To eliminate the need for external flow labels, the authors introduce a Similarity Flow Loss ( $\mathcal{L}_{sim}$ ):

Mechanism: It computes cosine similarities between dynamic BEV features at time $t$ and their neighbors in $t-1$ (and $t+1$ ) within a search window.
Pseudo-Labels: The displacement corresponding to the maximum cosine similarity is treated as a pseudo-flow label ( $f^s$ ).
Training: The predicted flow is regressed against these pseudo-labels using an L1 loss, weighted by a consistency term that down-weights regions where forward and backward pseudo-labels disagree.

D. Ray-Based Supervision

The model is supervised using rays cast from cameras and LiDAR:

Photometric Loss ( $\mathcal{L}_{photo}$ ): Uses depth and RGB reconstruction losses (extending SelfOcc's approach) on camera rays.
LiDAR Loss ( $\mathcal{L}_{lidar}$ ):
- Static Rays: Sampled from aggregated static fields across multiple timesteps to learn geometry in occluded regions.
- Dynamic Rays: Sampled only from the current timestep to avoid temporal inconsistency artifacts.
- Density Loss: Penalizes empty predictions in dynamic regions where LiDAR points exist.

3. Key Contributions

First End-to-End Self-Supervised 3D Occupancy Flow: The method jointly learns geometry and motion without any 3D occupancy labels, flow annotations, or pretrained optical flow models.
Semantic Disentanglement: A novel approach to separate static and dynamic SDFs using semantic masks from foundation models, which provides more stable training than motion-based disentanglement.
Implicit Flow Learning via Warping: Utilizes temporal aggregation with flow warping on the dynamic field to implicitly learn motion consistency.
Similarity Flow Cue: Introduces a robust self-supervised signal derived from feature cosine similarities to guide flow prediction, replacing the need for external optical flow networks.

4. Experimental Results

The method was evaluated on SemanticKITTI, KITTI-MOT, and nuScenes.

SemanticKITTI (3D Occupancy):
- Achieved a RayIoU of 50.20, outperforming LetOccFlow (47.06) and other baselines.
- Demonstrated superior performance in predicting small dynamic objects and inferring geometry in occluded regions (e.g., behind cars).
KITTI-MOT (Occupancy Flow):
- Achieved state-of-the-art depth estimation metrics (DE: 1.907) and competitive optical flow results (EPE: 6.788) without using 2D optical flow supervision.
- Showed strong generalization: A model trained on SemanticKITTI and tested on KITTI-MOT without fine-tuning achieved even better results, highlighting robustness.
nuScenes (3D Occupancy Flow):
- Set a new state-of-the-art with a RayIoU of 41.39 and reduced Mean Average Velocity Error (mAVE) by 7.7% compared to LetOccFlow (1.31 vs 1.45).
Efficiency:
- The model is significantly lighter than LetOccFlow: 32.4M parameters (vs. 253.3M) and 405 GFLOPs (vs. 3202 GFLOPs), resulting in 3.78 FPS on a V100 GPU compared to 1.04 FPS.

5. Significance

SelfOccFlow represents a major step forward in autonomous perception by removing the "label bottleneck" for 3D occupancy flow.

Scalability: By relying solely on self-supervision, the method can be trained on vast amounts of unlabeled driving data, making it highly scalable.
Efficiency: The lightweight architecture makes it feasible for real-time deployment on vehicle hardware.
Robustness: The semantic disentanglement and foundation model guidance provide a stable way to handle complex dynamic scenes without the noise often introduced by motion-based separation or external flow priors.

This work demonstrates that high-quality 3D scene understanding and motion estimation can be achieved purely through spatio-temporal consistency and feature learning, paving the way for more accessible and robust autonomous driving systems.