BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

Imagine you are an architect trying to rebuild a detailed model of a city, but you only have a messy, incomplete pile of sand to work with. Sometimes the sand is spread out evenly (like a high-quality scan), sometimes it's clumped in weird spots (like a photo taken from a drone), and sometimes it's so sparse you can barely see the shape of the buildings at all.

This is the problem BuildAnyPoint solves. It's a new AI system that can take these messy piles of 3D "sand" (called point clouds) and turn them into clean, structured, artist-quality 3D building models.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Messy Sand" Dilemma

Previously, if you wanted to turn a 3D scan of a building into a clean model, you needed the scan to be perfect.

Old Method A: If the scan was too messy, the AI would guess wrong and make a weird, jagged mess.
Old Method B: If the scan was too sparse (like a few dots in the air), the AI would just give up or force the building into a rigid, boxy shape that didn't look real.

It was like trying to bake a cake with a recipe that only works if you have exactly 100% of the ingredients. If you were missing flour or had too much sugar, the cake would fail.

2. The Solution: A Two-Step "Magic Kitchen"

BuildAnyPoint is like a master chef who doesn't just follow a recipe; they imagine what the cake should look like based on the crumbs they have. It uses a two-step process called Loca-DiT (a fancy name for a "Loosely Cascaded Diffusion Transformer").

Step 1: The "Clarity Lens" (Recovering the Shape)

First, the AI looks at your messy pile of sand (the input point cloud).

The Analogy: Imagine looking at a foggy window. You can see vague shapes, but no details. The AI acts like a magical wiper that clears the fog.
What it does: It uses a Diffusion Model (think of it as a "denoising" engine). It takes your sparse, noisy dots and "hallucinates" the missing parts to create a dense, perfect cloud of points. It's like filling in the missing pieces of a puzzle based on the picture on the box. Now, instead of a few scattered dots, you have a solid, smooth cloud of points that perfectly outlines the building.

Step 2: The "Sculptor's Hand" (Building the Mesh)

Once the AI has that perfect, dense cloud of points, it moves to the second step.

The Analogy: Now that the chef has a perfect bowl of batter (the dense points), they need to pour it into a mold to make the cake.
What it does: It uses an Autoregressive Transformer (think of it as a very smart, step-by-step builder). It looks at the dense points and says, "Okay, this part is a wall, this part is a slanted roof, and this is a window." It then builds a mesh (a wireframe skin made of triangles) over those points. Because the points were already cleaned up in Step 1, the mesh comes out smooth, low-poly (efficient), and looks like something a human artist designed.

3. Why is this a Big Deal?

It's Flexible: Whether you give it a high-quality laser scan, a shaky drone photo, or a very sparse scan from far away, it works the same way. It doesn't care how the data was collected; it just fixes the data first.
It's Smart: Unlike older methods that tried to force the building into a rigid box, this AI understands the "vibe" of a building. It knows that roofs are usually slanted and walls are usually straight, even if the input data is missing those details.
It's a Bridge: It bridges the gap between "raw data" (messy dots) and "digital twins" (clean 3D models used for navigation, disaster planning, and video games).

Summary Analogy

Imagine you have a torn, dirty photograph of a house.

Old AI: Tries to trace the lines directly. If the photo is torn, the lines are broken, and the result looks like a glitchy mess.
BuildAnyPoint: First, it uses AI to repair the photo, filling in the tears and cleaning the dirt until you have a pristine, high-resolution image of the house. Then, it uses that perfect image to draw a clean, professional architectural blueprint.

By fixing the "image" (the point cloud) before drawing the "blueprint" (the mesh), BuildAnyPoint can handle almost any kind of input data and produce beautiful, usable 3D buildings.

1. Problem Statement

The core challenge addressed is 3D building structured abstraction: reconstructing clean, low-polygon, artist-created 3D building meshes from unstructured, noisy, and sparse point clouds.

The Gap: Existing methods struggle with diverse point cloud distributions.
- Optimization-based methods (e.g., PolyFit, City3D) rely on plane detection and often fail on sparse or noisy data (like airborne LiDAR) or require specific pre-processing.
- Learning-based methods (e.g., Point2Building) use autoregressive generation but often produce geometrically ambiguous results or require dense, clean inputs.
- Grammar-based methods (e.g., ArcPro) enforce structural coherence via architectural grammars but sacrifice geometric flexibility, failing to model complex structures like slanted roofs or irregular shapes.
The Goal: Create a generalizable framework that can handle arbitrary input distributions (from dense SfM to sparse LiDAR) without domain-specific pre-processing, recovering both the underlying geometry and generating a topologically consistent mesh.

2. Methodology: BuildAnyPoint & Loca-DiT

The authors propose BuildAnyPoint, a generative framework powered by a novel architecture called Loosely Cascaded Diffusion Transformer (Loca-DiT). The pipeline bridges the gap between unstructured point clouds and structured meshes through two sequential stages in a latent space.

A. Stage 1: Distribution Recovery (Latent Diffusion)

Instead of directly generating a mesh from noisy points, the model first recovers a dense, uniform, and accurate intermediate point cloud ( $P_{out}$ ).

Mechanism: A hierarchical Latent Diffusion Model operates on sparse voxel grids.
Process:
1. Dense Latent Grid ( $G_d$ ): The input point cloud ( $P_{in}$ ) is encoded into a dense latent grid. This step is crucial for recovering the global geometric prior and handling missing data/noise.
2. Sparse Latent Grid ( $G_s$ ): The dense grid is refined into a sparse latent grid to capture high-frequency details.
3. Decoding: A sparse VAE decoder reconstructs the dense point cloud $P_{out}$ from $G_s$ .
Significance: This stage effectively "denoises" and "densifies" the input, emulating the high-quality data required by state-of-the-art mesh generators. It acts as a probabilistic prior, guarding against overfitting to sparse artifacts.

B. Stage 2: Structured Mesh Generation (Autoregressive Transformer)

Once the clean intermediate point cloud ( $P_{out}$ ) is recovered, it is used to condition an autoregressive transformer to generate the final mesh.

Tokenization: The recovered point cloud is tokenized into a sequence ( $T_P$ ) using a pre-trained point cloud encoder.
Generation: A Decoder-only Transformer (based on MeshAnything V2) autoregressively generates the mesh token sequence ( $T_M$ ) conditioned on $T_P$ .
Output: The tokens are detokenized into a final, low-polygon, topologically consistent mesh ( $M$ ).

C. The "Loosely Cascaded" Paradigm

The key innovation is the loose coupling of these two stages. Unlike end-to-end models that struggle with the domain gap between raw points and meshes, BuildAnyPoint uses the intermediate point cloud as a bridge. This allows the diffusion model to focus on geometric recovery and the transformer to focus on topological structure, leveraging the strengths of both generative paradigms.

3. Key Contributions

First Generalizable Framework: The first method capable of performing 3D building abstraction across diverse point cloud distributions (Airborne LiDAR, SfM, and extreme sparse sampling) without requiring hand-crafted architectural grammars or specific pre-processing.
Loca-DiT Architecture: A novel loosely-cascaded design that synergizes hierarchical sparse diffusion (for robust geometric recovery) with autoregressive sequence modeling (for structured mesh generation).
Dual-Stage Recovery: The explicit recovery of a dense, uniform intermediate point cloud ( $P_{out}$ ) serves as a critical structural guide, enabling the subsequent mesh generator to produce high-fidelity results even from extremely sparse inputs.
State-of-the-Art Performance: The method achieves superior results in both final mesh quality and intermediate point cloud completion benchmarks.

4. Experimental Results

The method was evaluated on the Building-PCC benchmark (50k instances from The Hague and Rotterdam) across three scenarios: Airborne LiDAR, SfM, and Sparse Sampling.

Mesh Abstraction Quality:
- Metrics: Outperformed baselines (City3D, Point2Building) in Vertex Count (#V), Face Count (#F), Plane Count (#P), Failure Rate (FR), and Chamfer Distance (CD).
- Results: Achieved the lowest CD (0.036) and 0% failure rate, compared to Point2Building's 1% failure rate and higher CD (0.043). It successfully reconstructed complex geometries (e.g., slanted roofs) that grammar-based methods failed to capture.
Point Cloud Completion:
- The intermediate recovered point clouds ( $P_{out}$ ) were evaluated as a standalone completion task.
- Results: Achieved the best F-score (0.91), CD (0.35), and Uniformity (0.04). The uniformity score was nearly an order of magnitude better than competitors, proving the method's ability to generate evenly distributed points essential for meshing.
Ablation Studies:
- Removing the coarse latent grid ( $G_d$ ) led to chaotic point clouds.
- Removing the fine latent grid ( $G_s$ ) caused a "double-surface effect."
- Replacing the transformer with a traditional solver (using the recovered points) resulted in invalid surfaces, confirming the necessity of the autoregressive component.

5. Significance and Impact

Robustness to Real-World Data: By decoupling geometric recovery from mesh generation, the framework overcomes the limitations of current methods that fail on the noisy, sparse, and irregular data typical of real-world urban sensing (LiDAR/SfM).
Digital Twin & Urban Planning: The ability to generate clean, structured, and low-polygon meshes from raw sensor data is critical for creating scalable digital twins, disaster simulations, and navigation systems.
Modularity: The framework is modular; the diffusion and transformer components can be upgraded independently as the respective fields (3D diffusion and autoregressive mesh generation) advance.
Future Direction: The work highlights the potential of using explicit 3D generative priors to solve ill-posed reconstruction problems, suggesting a path toward more flexible and data-efficient 3D reconstruction systems.