PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds

Imagine you are trying to understand a giant, 3D sculpture made of thousands of floating dust motes (this is a point cloud from a car's LiDAR sensor). Your goal is to find cars, pedestrians, and cyclists hidden inside this cloud so an autonomous vehicle can drive safely.

For a long time, researchers had two main ways to look at this sculpture, and both had a major flaw:

The "Voxel" Method (The High-Res 3D Puzzle): They chopped the entire 3D space into tiny, 3D cubes (like a giant 3D Rubik's cube). This is incredibly accurate because it sees every little detail in 3D. But, it's like trying to solve a 3D puzzle while wearing heavy winter gloves. It's slow and computationally expensive.
The "Pillar" Method (The Flat Shadow): They squashed the 3D dust motes down into flat, vertical columns (like a stack of pancakes). This is super fast because it's easier to process, but it loses the vertical details. It's like looking at a shadow of the sculpture; you can tell something is there, but you might miss if it's a tall truck or a short pedestrian.

Enter PointSlice: The "Sliced Bread" Solution

The authors of this paper, PointSlice, asked a simple question: "What if we could have the speed of the flat shadow but the accuracy of the 3D puzzle?"

Their answer is PointSlice, which treats the 3D point cloud like a loaf of bread.

The Core Idea: Slicing the Loaf

Instead of looking at the whole 3D loaf at once (slow) or squashing it flat (inaccurate), PointSlice slices the loaf horizontally into many thin, 2D slices.

The Analogy: Imagine you have a 3D model of a building. Instead of trying to analyze the whole building in 3D, you take a knife and slice it into 50 horizontal layers (like a layer cake).
The Magic: Now, instead of using a slow, complex 3D brain to look at the whole building, you can use a fast, 2D brain (which is what our eyes and standard computer chips are great at) to look at each slice individually. You process all 50 slices very quickly, just like flipping through pages in a book.

The Problem: Losing the "Story"

If you just look at each slice of the cake separately, you lose the connection between them. You might see a slice with a "wheel" and another slice with a "roof," but you don't know they belong to the same car. If you treat them as totally separate 2D images, you lose the 3D shape.

The Solution: The "Slice Interaction Network" (SIN)

This is the secret sauce of PointSlice.

The Analogy: Imagine you have 50 people looking at 50 different slices of the cake. They are all working fast, but they aren't talking to each other.
The Fix: PointSlice adds a special "communication channel" called the Slice Interaction Network (SIN). Every few steps, the network pauses, gathers all the slices back together, and lets them "talk" to each other.
How it works: It briefly reassembles the slices into a 3D shape just enough to say, "Hey, the wheel in slice #5 and the roof in slice #10 are part of the same car!" Then, it slices them back up to keep processing fast.

Why is this a Big Deal?

The paper shows that PointSlice is the "Goldilocks" of 3D detection:

It's Fast: Because it mostly uses 2D processing (like looking at flat pictures), it runs 13% faster than the most accurate 3D methods currently available.
It's Accurate: Because it uses the "communication channel" (SIN) to stitch the slices back together, it is almost as accurate as the slow, heavy 3D methods.
It's Efficient: It uses 20% less computer memory (parameters) than the top competitors. This is huge for self-driving cars, which have limited computing power on board.

Real-World Results

The team tested this on three massive datasets (Waymo, nuScenes, and Argoverse 2).

On the Waymo dataset, their model was faster and used less memory than the best 3D model, with almost no drop in accuracy.
On nuScenes, they actually hit a new world record for accuracy while still being very efficient.

The Bottom Line

PointSlice is like realizing you don't need to build a massive 3D hologram to understand a room. You just need to take a few high-quality photos of the room from different angles (slices), let a smart AI compare them quickly, and it can understand the 3D space perfectly.

It solves the age-old trade-off in self-driving cars: You no longer have to choose between being fast or being smart. PointSlice lets you be both.

1. Problem Statement

In autonomous driving, 3D object detection from LiDAR point clouds faces a critical trade-off between detection accuracy and inference efficiency.

Voxel-based methods (e.g., SAFDNet, HEDNet) divide point clouds into 3D grids, offering high accuracy by capturing fine-grained spatial details in all three dimensions ( $x, y, z$ ). However, they suffer from high computational costs and slower inference speeds due to the complexity of 3D convolutions.
Pillar-based methods (e.g., PointPillars, PillarNet) compress data into the $x-y$ plane (vertical columns), significantly improving speed but often sacrificing accuracy, particularly for objects with complex vertical structures.
The Gap: Existing approaches struggle to achieve the high accuracy of voxel-based methods while maintaining the inference speed of pillar-based methods. Directly applying voxel-based networks to pillar-formatted data results in significant accuracy drops.

2. Methodology: PointSlice

The authors propose PointSlice, a novel framework that bridges this gap by converting 3D point clouds into a series of 2D horizontal slices, processed through a specialized network architecture.

A. Slice-Based Representation

Instead of treating the point cloud as a 3D volume or vertical pillars, PointSlice partitions the point cloud horizontally along the $z$ -axis.

Transformation: The 3D voxel space $(B, H, W, L)$ is transformed into $H$ separate 2D slices.
Batching: These slices are treated as a batch of 2D data $(B \times H, W, L)$ . The height dimension ( $z$ ) is effectively merged into the batch dimension.
Benefit: This allows the model to utilize 2D sparse convolutions for the majority of feature extraction, drastically reducing the computational complexity from cubic ( $K^3$ ) to quadratic ( $K^2$ ) for most layers.

B. Slice Interaction Network (SIN)

A major challenge of slicing is the loss of vertical geometric relationships between slices. To address this, PointSlice introduces the Slice Interaction Network (SIN).

Mechanism: SIN temporarily reconstructs the 2D slices back into a 3D voxel representation at specific stages within the backbone network.
Operation: It employs 3D sparse convolutions (both regular and submanifold) to facilitate information exchange across different height slices.
Design Strategy: To maintain efficiency, SIN is inserted sparingly (only 4 layers in the 3D domain) within a backbone dominated by 2D layers. This preserves vertical context without incurring the full cost of a 3D CNN.

C. Network Architecture

The PointSlice backbone consists of three main components:

2D Sparse Backbone: Composed of 2D Sparse Residual Blocks (2D-SRB) and 2D Sparse Encoder-Decoder Blocks (2D-EDB), inspired by SAFDNet but operating on 2D slices.
Slice Interaction Modules (SIN): Inserted between 2D blocks to enable cross-slice feature aggregation.
Sparse Detection Head: Uses an Adaptive Feature Diffusion (AFD) strategy to ensure high detection accuracy with a fully sparse output.

3. Key Contributions

Novel Representation: Proposes a slice-based encoding that converts 3D point clouds into 2D data batches, enabling the use of efficient 2D convolutional networks while retaining 3D structural information.
Slice Interaction Network (SIN): Introduces a dedicated module that integrates sparse 3D convolutions into a 2D backbone to recover vertical geometric relationships, solving the accuracy loss inherent in pure 2D processing.
Efficiency-Accuracy Balance: Demonstrates a new state-of-the-art balance, achieving near-voxel accuracy with pillar-level (or better) inference speeds and reduced parameter counts.

4. Experimental Results

The model was evaluated on three major autonomous driving datasets: Waymo Open Dataset, nuScenes, and Argoverse 2.

Waymo Open Dataset:
- Speed: Achieved a 1.13× speedup over the state-of-the-art voxel-based method (SAFDNet).
- Parameters: Used only 0.79× the parameters of SAFDNet.
- Accuracy: Suffered only a marginal 1.2 mAPH reduction compared to SAFDNet, significantly outperforming pillar-based methods (e.g., +5.5 mAPH over PillarNet).
nuScenes Dataset:
- Achieved a State-of-the-Art (SOTA) 66.7 mAP.
- Used 0.45× fewer parameters than SAFDNet with a 1.08× speedup.
Argoverse 2 Dataset:
- Achieved 1.10× faster inference with 0.66× the parameters of SAFDNet, with a negligible 1.0 mAP drop.
Robustness: The model demonstrated superior robustness against point cloud sparsity and sensor noise compared to SAFDNet, particularly in low-density scenarios.

5. Significance and Impact

Paradigm Shift: PointSlice challenges the binary choice between 3D voxel and 2D pillar approaches, proving that a hybrid "slice-based" approach can offer the best of both worlds.
Deployment Viability: By significantly reducing memory usage (e.g., 264MB vs. 410MB on Waymo) and inference latency, PointSlice is highly suitable for real-time, resource-constrained edge devices in autonomous vehicles.
Scalability: The architecture is scalable; increasing the depth of the SIN modules allows for further accuracy improvements, particularly for small objects like pedestrians and cyclists, without sacrificing the fundamental efficiency gains.
Community Value: The approach provides a lightweight, easily deployable framework that can be adapted for other 3D vision tasks, such as multi-object tracking and sensor fusion.

In conclusion, PointSlice successfully resolves the accuracy-efficiency trade-off in 3D object detection by leveraging 2D processing efficiency while strategically reintroducing 3D context through the Slice Interaction Network.