A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction

Imagine you are standing on a busy street corner, trying to spot a friend in a crowd. The problem is that the street is filled with static things: trees, lampposts, buildings, and the road itself. These things never move, but they take up all your attention. If you could magically make all those stationary objects disappear from your vision, your friend would pop out instantly, right?

That is exactly what this paper is about, but instead of human eyes, it's about LiDAR sensors (the "eyes" of self-driving cars) sitting on the side of the road.

Here is the story of their solution, broken down into simple concepts:

1. The Problem: The "Static Noise"

Roadside LiDAR sensors shoot out laser beams to create a 3D map of the world. But most of what they see is boring, static background (the ground, walls, trees). The things we actually care about—cars, pedestrians, cyclists—are tiny specks in a sea of static data.

If the computer tries to find a car by looking at everything, it gets overwhelmed. It's like trying to find a specific red marble in a bucket full of sand. You need to scoop out the sand first.

2. The Old Way vs. The New Way

The Old Way: Many previous methods were like rigid rulebooks. They worked great for spinning laser scanners (like a lighthouse) but broke if you used a different type of sensor (like a solid-state chip). They were also often "black boxes"—you knew they worked, but you didn't know why.
The New Way (This Paper): The authors created a method that is fully interpretable. This means you can look at the math and say, "Ah, I see exactly how it decided that point was a car." It's like a clear recipe instead of a magic spell. It also works with any type of laser sensor, whether it spins or stays still.

3. The Secret Sauce: The "Statistical Map" (GDG)

The core of their idea is building a Gaussian Distribution Grid (GDG). Let's use an analogy:

Imagine you are a security guard at a museum. You want to know if someone is sneaking in.

The Training Phase: First, you stand there for a few minutes when the museum is empty. You take a "snapshot" of the empty room. You don't just memorize the picture; you memorize the average height of the floor in every square foot and how much the floor usually "wiggles" (maybe due to vibrations or wind).
- In the paper: They take a few seconds of "background-only" scans. They calculate the average height and the "wiggle room" (standard deviation) for every little square on the ground.
The Result: You now have a Statistical Map. You know that in square A, the ground is usually 0 meters high. In square B, there's a wall that is usually 2 meters high.

4. The Detection Phase: "Is this new?"

Now, a new car drives by. The sensor takes a new picture. The algorithm looks at every single laser point and asks two simple questions:

Question 1: "Is there anything here at all?"
- If the Statistical Map says "This square is empty," but the new scan has a point there... BINGO! That's a new object (Foreground).
Question 2: "Does this point fit the pattern?"
- If the map says "The wall here is usually 2 meters high," and the new scan sees a point at 2.05 meters... That's just the wall. It fits the pattern. Ignore it.
- But if the new scan sees a point at 1.5 meters (where the wall should be) or 3 meters (floating in the air)... BINGO! That's a car or a person. It doesn't fit the "wall pattern."

5. The "Noise Cleaner" (ROR)

Sometimes, the sensor gets a little jittery, or a leaf blows across the laser, creating a single, lonely dot that looks like a tiny object. The algorithm has a final step called Radius Outlier Removal.

Think of this as a "popularity contest." If a point is standing all alone in a crowd with no neighbors, the algorithm assumes it's a glitch and kicks it out. If a group of points is huddled together (like a car), they stay.

6. Why This Matters

It's Flexible: It works with old spinning sensors and new, tiny chip sensors.
It's Efficient: You don't need a supercomputer. The authors tested it on a tiny, cheap computer (the size of a credit card) and it worked well.
It's Honest: Because the math is simple and transparent, engineers and regulators can trust it. They can see exactly why the system flagged a pedestrian.
It Needs Little Data: You only need a few seconds of "empty road" video to teach the system what the background looks like.

The Bottom Line

This paper presents a smart, transparent, and flexible way to tell a self-driving car's roadside sensor: "Ignore the trees and the road; only look at the moving things." By using simple statistics instead of complex, unexplainable AI, they made the system safer, faster, and easier to understand.

1. Problem Statement

The integration of roadside LiDAR sensors is critical for advancing automated driving (AD) by providing infrastructure-based perception. However, a major challenge in processing roadside LiDAR data is background subtraction: distinguishing static background elements (roads, buildings, trees) from dynamic foreground objects (vehicles, pedestrians).

Challenges: Existing methods often lack flexibility, being tailored specifically to rotating LiDARs (azimuth-based) or requiring device-specific calibration (intensity-based). They often struggle with non-rotating sensors (e.g., MEMS, Risley Prism) and require large annotated datasets or extensive background data to function effectively.
Need: A method that is fully interpretable (transparent decision-making), flexible across diverse sensor types, and efficient enough for deployment on low-resource hardware.

2. Methodology

The authors propose a two-phase, fully interpretable statistical approach that relies on Gaussian distributions rather than deep learning.

Phase 1: Gaussian Distribution Grid (GDG) Generation

This phase builds a statistical model of the background using only background-only scans (no foreground objects).

Data Accumulation: Multiple background scans are combined to create a rich point cloud, mitigating sensor noise and capturing dynamic background elements (e.g., swaying trees).
Voxelization: The accumulated cloud is voxelized into a low-resolution 3D grid to ensure uniformity.
2D Grid Splitting: Both the accumulated cloud and the voxelized cloud are projected onto a 2D grid based on $(x, y)$ $(x, y)$ coordinates.
- Point Counting: The voxelized cloud counts the number of points per grid cell ( $num\_points$ ).
- Height Modeling: The rich accumulated cloud is used to calculate the Gaussian distribution (mean $\mu$ and standard deviation $\sigma$ ) of the height ( $z$ -value) for points within each grid cell.
Output: A Gaussian Distribution Grid where each cell contains statistical descriptors: $\mu$ , $\sigma$ , maximum density, and point count.

Phase 2: Background Subtraction

This phase classifies points in a new input scan as foreground or background using the generated GDG.

Voxelization & Counting: The input scan is voxelized and projected onto the same 2D grid.
Classification Logic: For each point, the algorithm checks its corresponding grid cell:
1. Empty Cell: If the GDG has no points for that cell, all points are classified as Foreground.
2. Point Count Threshold: If the difference in point count between the current scan and the background model is below a threshold ( $th\_points$ ), points are classified as Background.
3. Statistical Deviation: If the point count difference exceeds the threshold, the point's height is evaluated against the cell's Gaussian distribution. If the probability density is below a threshold ( $th\_density \times max\_density$ ), the point is classified as Foreground.
Post-Processing: A Radius Outlier Removal (ROR) filter is applied to the foreground points to remove isolated noise caused by sensor vibrations or measurement errors.

3. Key Contributions

Full Interpretability: The method is transparent by design; decisions are based on explicit statistical rules (Gaussian distributions and point counts) rather than "black box" neural networks.
Sensor Agnosticism: Unlike state-of-the-art methods limited to rotating LiDARs, this approach works with MEMS, Risley Prism, and rotating LiDARs, as well as single or multi-sensor setups.
Data Efficiency: The method achieves high accuracy with minimal background data (as few as 10 scans), whereas other methods often require extensive training or background accumulation.
Hardware Efficiency: Implemented in C++ with the Point Cloud Library (PCL), it runs efficiently on resource-constrained hardware (tested on Jetson Nano 2GB).

4. Experimental Results

The method was evaluated on the RCooper dataset, covering "corridor" and "intersection" scenarios with various sensor configurations (Multiline 360° and MEMS).

Performance vs. State-of-the-Art: The proposed method outperformed a recent state-of-the-art reference method [10] in nearly all metrics (Precision, Recall, F1, IoU, TPR, Completeness), even though the reference method used 400 background scans compared to the proposed method's 10–25 scans.
- Intersection IoU: Proposed (0.6972 for 360°, 0.8154 for MEMS) vs. Reference (0.3320).
Sensor Flexibility:
- MEMS sensors showed superior performance (higher IoU and TPR) compared to 360° sensors, likely due to denser point clouds.
- The method performed slightly better with individual sensors than fused multi-sensor setups, suggesting sensor fusion may introduce inconsistencies in this specific statistical framework.
Object-Level Detection: The method achieved high True Positive Rates (TPR) and Completeness, particularly for large vehicles (cars, trucks, buses). Smaller objects (pedestrians, bicycles) showed lower scores, which is expected due to their sparse LiDAR returns.
Computational Efficiency:
- On a Jetson Nano 2GB, the method processed single-sensor scans in 298ms (MEMS) and 575ms (360°).
- Multi-sensor fusion increased processing time (up to ~2.5s), but the authors note this is feasible with more powerful central processing units.
- The most computationally expensive steps were point counting and background filtering; ROR became a bottleneck in high-density scenes due to $O(K^2)$ complexity.

5. Significance

This paper presents a robust solution for infrastructure-based perception that addresses the critical need for interpretability in safety-critical autonomous driving systems. By demonstrating that a simple, statistically grounded approach can outperform complex, data-hungry methods across diverse sensor types, the authors provide a scalable and deployable framework for smart city infrastructure. The ability to operate with minimal background data and on low-cost hardware makes it highly suitable for real-world, large-scale deployment where training data is scarce and computational resources are limited.

Future Work: The authors plan to optimize the implementation for real-time performance and extend the system to include semantic object classification (identifying what the foreground object is, not just that it is an object).