Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

Imagine you are trying to teach a robot to understand a room just by looking at a cloud of floating dots (a "point cloud") that represents the furniture, walls, and floor. This is a huge challenge because, unlike a photo which is a neat grid of pixels, these dots are scattered randomly in 3D space.

For a long time, the best way to teach robots to do this was using complex "MLP" (Multi-Layer Perceptron) networks. Think of these networks as a team of workers trying to figure out what the dots mean. However, these teams were often slow, expensive to run, and their inner workings were a bit of a black box.

This paper introduces a new, smarter way to organize these workers, called HPENet. Here is how it works, explained with simple analogies:

1. The Two-Stage Strategy: "The Rough Draft and The Polish"

The authors realized that all the best point-cloud models actually do two distinct things, but they often mix them up. They propose a clear two-stage process called ABS-REF (Abstraction and Refinement).

Stage 1: Abstraction (The "Rough Draft"): Imagine a sculptor looking at a pile of clay. First, they pick out the most important chunks and throw away the rest to get the general shape. In the robot's world, this is where the system picks key points and groups nearby dots together to understand the "big picture" (e.g., "This is a chair").
Stage 2: Refinement (The "Polish"): Once the rough shape is there, the sculptor goes back in with a fine brush to smooth out the edges and add details. In the robot's world, this stage takes that rough shape and polishes it without changing the number of points, making the details sharper (e.g., "This is specifically a wooden armchair, not a plastic stool").

The Insight: Old models were great at the "Rough Draft" but terrible at the "Polish." Newer models (like Transformers) were great at polishing but slow and heavy. HPENet combines the best of both: it does a quick rough draft and then a very efficient polish.

2. The Magic Compass: High-Dimensional Positional Encoding (HPE)

Point clouds have a unique problem: the dots don't have labels like "top," "bottom," or "left." They just have X, Y, and Z coordinates.

The Old Way: Previous models treated these coordinates like a simple address label. They just stuck the numbers next to the data. It was like giving a map with only street names but no street numbers.
The New Way (HPE): The authors invented a "High-Dimensional Positional Encoding." Imagine taking that simple 3D address and translating it into a complex, multi-layered language that the computer understands perfectly. It's like giving the robot a 3D compass that doesn't just say "North," but tells it exactly how the shape curves, tilts, and relates to its neighbors. This allows the robot to understand the geometry of the object much better, even if the object is rotated or moved.

3. The Efficient Team: Non-Local MLPs

In the old "Rough Draft" stage, the workers (the neural network) would only talk to their immediate neighbors. It was like a game of "telephone" where information gets lost if the chain is too long.

The Change: HPENet introduces Non-Local MLPs. Imagine instead of whispering to the person next to you, the workers can instantly shout across the room to anyone they need to. This allows the robot to understand the whole shape at once, not just the tiny piece it's standing on.
The Result: This makes the system much faster (using less computer power) because it doesn't need to do as many repetitive calculations to get the same result.

4. The Feedback Loop: Backward Fusion Module (BFM)

Usually, in these systems, information flows one way: from the "big picture" down to the "details." But sometimes, the details tell you something important about the big picture.

The Innovation: The authors added a Backward Fusion Module. Think of this as a feedback loop. If the "polishing" stage realizes a detail is wrong, it can send a message back up to the "rough draft" stage to fix the initial understanding. It's like an editor telling the writer, "Wait, you described the chair as red, but the details show it's blue; let's fix the main description." This ensures the final result is consistent and accurate.

Why Does This Matter?

The authors tested their new system (HPENet) on seven different datasets, from recognizing 3D objects to mapping entire rooms.

It's Faster: It runs about 2.2 times faster than the previous best models.
It's Smarter: It is more accurate at identifying objects and their parts.
It's Efficient: It uses significantly less computer power (FLOPs), meaning it could run on a phone or a robot's onboard computer rather than needing a massive supercomputer.

In a Nutshell:
The authors took the messy, slow process of teaching robots to see 3D worlds and organized it into a clear "Draft then Polish" workflow. They gave the robot a super-precise 3D compass (HPE), let the workers talk across the whole room instead of just to neighbors (Non-Local MLPs), and added a feedback loop to catch mistakes (BFM). The result is a system that sees 3D worlds faster, cheaper, and more accurately than ever before.

1. Problem Statement

Point cloud processing is critical for applications like autonomous driving and robotics, yet it faces unique challenges due to the irregular, unordered, and sparse nature of 3D data.

Limitations of Existing MLPs: While Multi-Layer Perceptron (MLP) based methods (e.g., PointNet++, PointNeXt) are foundational, their complex architectures often obscure the source of their performance. They frequently rely on computationally expensive local aggregation (local MLPs) that process grouped neighbors, leading to high FLOPs (Floating Point Operations) and limiting real-world deployment.
Limitations of Transformers: Transformer-based methods offer powerful feature aggregation but are often computationally intensive, making them less suitable for resource-constrained environments.
Inefficient Positional Encoding: Current point-based methods often treat positional information as a simple concatenation of coordinates, failing to fully exploit the intrinsic geometric properties of point clouds.
Unilateral Feature Interaction: In encoder-decoder architectures, the interaction between high-resolution and low-resolution features is often unilateral, leading to insufficient utilization of contextual information.

2. Methodology

The authors propose a unified framework and several novel modules to address these issues.

A. The ABS-REF View (Abstraction and Refinement)

The paper introduces a two-stage paradigm to analyze and design point cloud networks, drawing an analogy to subsampling and convolution in image processing:

Abstraction (ABS) Stage: Reduces the number of points (sub-sampling) and aggregates local context from neighboring points to a centroid. This stage extracts lower-resolution features.
Refinement (REF) Stage: Operates on the abstracted features without changing resolution. It refines the features by aggregating inter-set context information, effectively increasing the receptive field and scalability.

Insight: Early methods (e.g., PointNet++) focused heavily on the ABS stage but lacked a dedicated REF stage. Recent high-performing methods implicitly include REF stages. The authors argue that separating these stages allows for optimized strategies in each.

B. High-Dimensional Positional Encoding (HPE)

To better utilize geometric information, the authors propose HPE, extending the concept of positional encoding from Transformers.

Mechanism: Instead of concatenating raw coordinates, HPE projects relative point coordinates into a high-dimensional space (using either sinusoidal functions or learnable MLPs).
Translation Invariance: It uses relative coordinates to ensure translation invariance.
Alignment: A lightweight MLP aligns these high-dimensional vectors to the feature space.
Benefit: This provides a richer, more robust geometric representation than standard 3D coordinate concatenation, crucial for capturing local context.

C. Non-Local MLPs for Efficient Aggregation

The authors rethink local aggregation to reduce computational costs:

Problem: Traditional local MLPs operate on grouped neighbors after sampling, which is computationally heavy ( $O(N \times k)$ ).
Solution: They propose replacing time-consuming local MLPs with Non-Local MLPs. These operate on the input point set before the grouping operation.
Strategy:
- ABS Stage: Uses a hybrid approach. The first ABS stage uses traditional local aggregation (Conv) to preserve fine-grained details, while subsequent stages use efficient non-local MLPs (PreConv).
- REF Stage: Uses non-local MLPs (PreConv) to update features efficiently without the overhead of local grouping operations.

D. Backward Fusion Module (BFM)

To address the lack of bilateral interaction in encoder-decoder networks:

Mechanism: BFM captures statistical contextual information (via max-pooling and mean-pooling) from high-resolution features.
Interaction: It embeds this context into multi-resolution features using inverted residual MLP blocks and propagates information backward to low-resolution features.
Result: This enables bilateral interaction, improving the balance of feature representation across different scales.

E. HPENets Architecture

The authors integrate these components into a suite of networks called HPENets (HPENet V2). The architecture follows the ABS-REF paradigm, utilizing HPE in both stages and employing the specific aggregation strategies (Local in early ABS, Non-Local in later ABS and REF) to balance efficiency and accuracy.

3. Key Contributions

ABS-REF Paradigm: A unified theoretical view that disentangles point cloud processing into Abstraction and Refinement stages, providing a framework to understand and optimize existing and new architectures.
High-Dimensional Positional Encoding (HPE): A novel module that projects relative coordinates into high-dimensional spaces, significantly outperforming standard positional encoding in MLP-based models.
Efficient Aggregation Strategy: A rethinking of local aggregation that replaces expensive local MLPs with non-local MLPs in most stages, drastically reducing FLOPs while maintaining performance.
Backward Fusion Module (BFM): A simple module that enables bilateral interaction between multi-resolution features, enhancing contextual understanding.
HPENet V2: A new family of networks that achieves state-of-the-art (SOTA) performance with significantly fewer parameters and higher inference speeds compared to previous MLP and Transformer-based methods.

4. Experimental Results

The authors evaluated HPENet V2 on seven public datasets across four tasks: 3D object classification, scene semantic segmentation, 3D object part segmentation, and 3D object detection.

3D Object Classification (ScanObjectNN & ModelNet40):
- HPENet V2 surpasses PointNeXt by 1.1% mAcc on ScanObjectNN.
- It achieves SOTA on ModelNet40 with 94.0% OA (with specific configurations) while using significantly fewer FLOPs than competitors like PointMLP.
Scene Semantic Segmentation (S3DIS, ScanNet, SemanticKITTI):
- S3DIS: HPENet V2-XL achieves 72.6% mIoU, outperforming PointNeXt by 4.0% mIoU and PointMetaBase by 1.1% mIoU, with only 21.5% of the FLOPs of PointNeXt.
- ScanNet: Achieves 74.0% mIoU, outperforming PointNeXt by 2.5% mIoU.
- SemanticKITTI: Outperforms the Transformer-based SphereFormer by 1.5% mIoU while using 5x fewer FLOPs.
3D Object Part Segmentation (ShapeNetPart):
- Achieves 87.3% Instance mIoU, setting a new SOTA for MLP-based methods and surpassing Transformer-based Stratified Transformer.
Efficiency:
- HPENet V2 is 2.2x faster than the original HPENet and significantly faster than PointNeXt and PointMetaBase.
- It uses 50% to 44% of the FLOPs of strong MLP counterparts while delivering higher accuracy.
Compatibility: The proposed modules (HPE, Non-Local MLPs, BFM) were successfully integrated into Transformer backbones (Point Transformer and Stratified Transformer), yielding 2.5% and 1.3% mIoU gains respectively, proving their generalizability.

5. Significance

This paper is significant because it:

Demystifies MLP Performance: By introducing the ABS-REF view, it clarifies why certain architectures work better and provides a blueprint for future efficient designs.
Bridges the Gap: It demonstrates that MLP-based methods can outperform or match Transformer-based methods in both accuracy and efficiency, challenging the notion that Transformers are necessary for high-performance point cloud processing.
Practical Deployment: The drastic reduction in computational cost (FLOPs) and parameters makes high-accuracy 3D perception feasible for real-time, resource-constrained applications like autonomous vehicles and mobile robotics.
Geometric Insight: The success of HPE highlights that explicit, high-dimensional modeling of geometric relationships is more effective than simple coordinate concatenation or attention mechanisms alone.