ToFormer: Towards Large-scale Scenario Depth Completion for Lightweight ToF Camera

Imagine you are a tiny drone trying to fly through a massive, complex warehouse or a sprawling outdoor field. To navigate safely, you need to "see" how far away things are.

Most drones use LiDAR (like a high-tech bat using sound) or Stereo Cameras (like human eyes) to see far away. But these are heavy, bulky, and eat up a lot of battery.

Enter the ToF (Time-of-Flight) camera. It's tiny, light, and cheap—perfect for small drones. However, it has a major flaw: it's myopic. It can only "see" clearly up to about 3 to 6 meters (roughly 10–20 feet). Beyond that, the image just turns into a blurry void. If your drone tries to fly into a large room using only this camera, it will crash because it can't see the walls until it's too late.

This paper introduces ToFormer, a clever system that acts like glasses for a myopic drone, allowing it to see clearly across a whole stadium using only a tiny, short-range camera.

Here is how they did it, broken down into three simple parts:

1. The Problem: The "Blind Spot" and the "Missing Map"

Existing solutions tried to fix this by teaching computers to guess the missing parts of the image. But there was a catch:

The Training Data was Fake: Most previous AI models were trained on "fake" missing data where the holes were spread out evenly (like a grid).
The Real World is Messy: Real ToF cameras don't lose data evenly. They lose it in big, weird chunks depending on the material (shiny walls reflect the signal away, dark corners absorb it).
The Result: Old AI models were like students who only studied for a multiple-choice test with perfect spacing. When they faced the messy, real-world test, they failed.

2. The Solution: A New "School" and a New "Teacher"

Step A: Building the "Real-World School" (The LASER-ToF Dataset)

To teach the AI properly, the researchers built a special robot platform. They didn't just take pictures; they used a LiDAR (the heavy, long-range sensor) to scan the room while the tiny ToF camera took its short-range pictures.

The Analogy: Imagine the LiDAR is a master painter who can see the whole landscape. The ToF camera is a child who can only see the ground right in front of them.
The Trick: They used the master painter's view to create a "perfect map" (Ground Truth) for every single picture the child took. This created the LASER-ToF dataset, the first "textbook" that teaches AI how to handle the messy, real-world blind spots of ToF cameras.

Step B: The New "Teacher" (The ToFormer Network)

They built a new AI brain called ToFormer. Instead of just looking at the 2D picture, it does three smart things:

It looks at the "3D Skeleton": It takes the few 3D dots the ToF camera did catch and treats them like a skeleton. It uses a special "3D Branch" to understand the shape of the world, not just the flat image.
It connects the dots (JPP): Imagine you have a few puzzle pieces scattered on a table. Old AI tried to guess the picture by looking at the pieces one by one. ToFormer uses a "Joint Propagation Pooling" module to instantly connect those scattered pieces to the surrounding empty space, filling in the gaps logically.
It listens to the Drone's GPS (Visual SLAM): If the drone is flying, it often knows where it is using its own internal map (Visual SLAM). ToFormer can use these "GPS dots" as extra hints to fill in the far-away parts of the image, making the prediction even more accurate.

3. The Result: A Super-Drone

The researchers put this system on a real, small drone (a quadrotor) and tested it.

The Test: They flew the drone through a long corridor and a huge open field.
Without ToFormer: The drone could only see 3 meters ahead. It had to fly very slowly, stopping frequently to check if a wall was coming. In a dead-end hallway, it flew straight into the wall because it couldn't see the end of the hall.
With ToFormer: The drone could "see" 15 meters ahead (5x further!). It flew faster, took smoother paths, and successfully avoided dead ends and obstacles it couldn't physically sense yet.

Why This Matters

This isn't just about better math; it's about practicality.

Lightweight: The system is so efficient it runs on a small computer (Jetson Orin NX) attached to a tiny drone, not a massive server room.
Cheaper: You don't need expensive, heavy LiDAR sensors on every robot anymore. A cheap ToF camera + this AI software does the job.
Versatile: It works in factories, warehouses, and outdoors.

In a nutshell: ToFormer takes a camera that is naturally "short-sighted" and gives it "long-range vision" by teaching it to understand the 3D shape of the world and using smart tricks to fill in the blanks. It turns a toy drone into a professional explorer.

1. Problem Statement

Time-of-Flight (ToF) cameras are attractive for robotics due to their compact size, low power consumption, and high measurement precision. However, they suffer from a limited sensing range (typically 3–6 meters), which restricts their deployment in large-scale scenarios (e.g., warehouses, outdoor fields, factories).

Existing Depth Completion methods face two fundamental challenges when applied to ToF cameras in large-scale settings:

Lack of Suitable Datasets: Existing datasets (e.g., NYU-Depth V2, KITTI) either lack ToF data, lack dense ground truth beyond short ranges, or rely on synthetic uniform sampling that does not reflect real-world ToF characteristics.
Non-Uniform Sparsity & Large Missing Regions: Unlike synthetic sparse depth inputs (uniformly sampled), real ToF depth maps exhibit non-uniform spatial distributions due to surface materials and physical sensing principles, often resulting in large missing regions. Standard sensor-agnostic networks fail to generalize to these specific patterns.

2. Methodology

The authors propose a full-stack framework comprising a new dataset, a novel network architecture, and a real-world robot deployment.

A. Dataset: LASER-ToF

To address the data gap, the authors constructed LASER-ToF (LArge-ScalE ScenaRio ToF), the first real-world dataset for large-scale ToF depth completion.

Collection Platform: A multi-sensor rig featuring a Livox Avia LiDAR, a global shutter RGB camera, a lightweight PMD Flexx2 ToF camera, and an IMU.
Ground Truth Generation: Instead of static accumulation (which is slow and prone to motion artifacts), they used a reconstruction-based pipeline utilizing an LVI-SLAM (LiDAR-Visual-Inertial) system. This allows for dense, per-frame ground truth generation while the robot is moving.
Statistics: 20,996 frames across 52 sequences (indoor/outdoor) with an average reliable range of 26.3 meters and a return density of 94.6%.
Inputs: The dataset provides RGB images, raw ToF depth, and optional Visual SLAM point clouds (to supplement sparse depth in far-range regions).

B. Network Architecture: ToFormer

The proposed ToFormer is a lightweight, sensor-aware network designed to handle non-uniform sparsity and large missing regions. It consists of three main components:

Encoder (2D RGB-D Fusion):
- Uses a CNN-Transformer hybrid design.
- Employs Serial Dilated Convolution (SDC) for local feature extraction and Cross-Covariance Attention (XCA) to model long-range relationships with linear computational complexity ( $O(N \cdot d^2)$ ) rather than quadratic.
- Multimodal Cross-Covariance Attention (MXCA): A novel module in the first stage that fuses RGB, sparse depth, and 3D point cloud features early in the network, bridging the cross-modal gap efficiently.
3D Branch:
- Processes the sparse point cloud (back-projected from RGB-D) using EdgeConv to aggregate non-local geometric relationships.
- 3D-2D Joint Propagation Pooling (JPP): A key innovation that converts sparse point cloud descriptors into a dense feature map. Unlike previous methods that project sparse points directly to 2D (resulting in sparse interactions), JPP performs an equivalent pooling operation followed by deformable convolution to create a dense-to-dense interaction with the 2D image features.
Decoder:
- A top-down decoder that upsamples features to generate multi-scale depth predictions.
- Refines the final full-resolution depth map using a Spatial Propagation Network (SPN) with a confidence weight mechanism to suppress outlier errors.

3. Key Contributions

LASER-ToF Dataset & Benchmark: The first dataset providing real-world, large-scale ToF depth samples with dense ground truth, specifically addressing the non-uniform sparsity and large missing regions of ToF sensors.
ToFormer Network: A sensor-aware architecture that explicitly models non-uniform ToF patterns. It introduces JPP for dense 3D-2D fusion and MXCA for efficient multimodal attention, achieving state-of-the-art performance while remaining lightweight.
Real-World Robot Validation: Successful deployment on a quadrotor (Jetson Orin NX) running at 10 Hz. The system enables large-scale dense mapping and long-range path planning in environments where raw ToF would fail.

4. Results

Quantitative Performance (LASER-ToF Benchmark)

Accuracy: ToFormer outperforms the second-best supervised method by 8.6% in Mean Absolute Error (MAE).
- ToF-Only Input: MAE reduced from 412.91mm (2nd best) to 379.06mm.
- ToF+Visual Input: MAE reduced from 447.64mm (2nd best) to 379.06mm.
Efficiency: Compared to the average baseline, ToFormer reduces parameters by 85.9% and runtime by 73.8%, making it suitable for edge devices.
Generalization: On the NYU-Depth V2 benchmark (uniform sampling), ToFormer remains competitive, proving it does not overfit to ToF-specific assumptions.

Qualitative & Robot Experiments

Mapping: In a 50m $\times$ 50m environment, ToFormer enabled the reconstruction of complete wall and ground structures up to 15m away, whereas raw ToF only perceived ~3m.
Path Planning: In autonomous navigation tasks:
- Scene I (Obstacle): Reduced travel time by 27.2% and energy cost by 24.7%.
- Scene II (Dead End): Raw ToF caused the robot to get trapped; ToFormer successfully detected the dead end and bypassed it.
- Scene III (Complex): Reduced travel time by 16.2% and energy cost by 29.0%.

5. Significance

This work bridges the gap between lightweight ToF sensors and large-scale robotic applications. By providing a dedicated dataset and a specialized network architecture, the authors demonstrate that ToF cameras can be effectively extended to large-scale scenarios without the weight and power penalties of LiDAR. The open-sourcing of the dataset, code, and hardware designs allows practitioners to deploy robust, long-range depth perception on resource-constrained robots (e.g., drones, warehouse bots) for tasks like dense mapping and safe autonomous navigation.