UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction

Imagine you are trying to build a perfect, life-sized digital twin of an entire city using only thousands of photos taken from different angles. This is the dream of 3D reconstruction.

For a long time, the best tools for this job were like trying to build a city out of fog. They looked good from a distance, but if you walked up close, the buildings were blurry, the roads were wobbly, and the memory on your computer would explode, causing the whole thing to crash.

Enter UrbanGS. Think of it as a new, super-smart construction crew that can build a massive, hyper-realistic city model without running out of space or time. Here is how they did it, explained through simple analogies:

1. The Problem: The "Foggy City"

Previous methods (like standard 3D Gaussian Splatting) are great for small rooms or single objects. But when you try to scale them up to a whole city, two things go wrong:

The Geometry is Wobbly: The buildings look like they are melting. The computer knows what color the wall is, but it doesn't know exactly where the wall is in 3D space.
The Memory Explosion: To make the city look detailed, the computer tries to put millions of tiny "pixels" (called Gaussians) everywhere, even in empty sky or far-away mountains. This is like trying to fill a swimming pool with water using a garden hose while also trying to fill the entire ocean with it. Your computer runs out of memory (OOM) and crashes.

2. The Solution: UrbanGS

The authors of this paper built a framework called UrbanGS that fixes these issues using three main "superpowers."

Superpower #1: The "Double-Check" GPS (Depth-Consistent D-Normal Regularization)

Imagine you are blindfolded and trying to find a wall.

Old Method: Someone tells you, "The wall is to your left." You turn left, but you don't know how far away it is. You might walk right into it or stop too far away.
UrbanGS Method: They use two guides.
1. The Normal Guide: Tells you the direction of the wall (like a compass).
2. The Depth Guide: Tells you exactly how far the wall is (like a laser rangefinder).

UrbanGS combines these two. It forces the computer to not only get the direction right but also the distance right. It's like having a GPS that corrects your steering and your speed simultaneously. This ensures the buildings are straight, the roads are flat, and the geometry is rock-solid.

Superpower #2: The "Smart Gardener" (Spatially Adaptive Gaussian Pruning)

Imagine you are painting a giant mural of a city.

Old Method: You paint every single brick on every building, even the ones on the far horizon that no one will ever see clearly. You also paint the empty sky with the same density as the busy downtown. This wastes a ton of paint (memory) and time.
UrbanGS Method: They hired a "Smart Gardener."
- If a part of the city is complex (like a busy intersection with trees and cars), the gardener plants dense, detailed Gaussians.
- If a part is simple (like a clear blue sky or a distant mountain), the gardener prunes (cuts away) the extra Gaussians.
- The Result: The computer only spends its energy where it matters. It keeps the high-definition details for the things you look at and removes the clutter from the background. This stops the computer from running out of memory.

Superpower #3: The "Puzzle Master" (Partitioning Strategy)

Building a city on one computer is like trying to solve a 10,000-piece puzzle on a tiny coffee table. It's impossible.

Old Method: They tried to force the whole puzzle onto the table, or they cut it into pieces but left gaps where the pieces didn't fit together, creating ugly cracks in the middle of the city.
UrbanGS Method: They cut the city into manageable "neighborhoods" (blocks) and assigned a different computer (GPU) to build each one.
- The Secret Sauce: They made sure the "fences" between the neighborhoods overlap slightly. This way, when the computers stitch the neighborhoods back together, there are no cracks or seams. It's like building a city block by block, but ensuring the sidewalks connect perfectly so you can walk from one block to the next without tripping.

The Result

When you put all these together, UrbanGS can:

Build Faster: It trains in about 2 hours instead of 20.
Look Better: The buildings are sharp, the roads are smooth, and the details (like trees and windows) are crisp.
Run Smaller: It fits on a standard high-end gaming card (RTX A5000) without crashing, whereas older methods would crash immediately.

In a Nutshell

If previous methods were like trying to build a city with a leaky bucket and a blurry map, UrbanGS is like having a team of architects with laser scanners, a smart pruning shears, and a perfect puzzle-solving strategy. They built a digital city that is so accurate and efficient, it feels like you could actually walk through it.

1. Problem Statement

While 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering for bounded scenes, its application to large-scale urban environments faces three critical bottlenecks:

Geometric Inconsistency: Standard 3DGS struggles to accurately model surfaces in complex city scenes. Existing methods that supervise rendered normals often fail to update the position parameters of Gaussians effectively, leading to floating artifacts and misaligned structures.
Memory and Scalability: Urban scenes contain millions of Gaussian primitives. Naive pruning strategies (designed for small objects) either oversimplify local details or fail to reduce memory usage, often causing "Out of Memory" (OOM) errors on standard GPUs (e.g., RTX A5000) when training on city-scale data.
Boundary Artifacts: Existing block-wise partitioning strategies often introduce visible discontinuities at block boundaries and suffer from computational inefficiencies due to processing irrelevant views.

2. Methodology

The authors propose UrbanGS, a unified framework designed to address these challenges through three core components:

A. Depth-Consistent D-Normal Regularization

To solve the issue of incomplete geometric updates, the authors introduce a novel regularization framework that moves beyond simple normal supervision.

The Limitation of Standard Normal Supervision: Supervising rendered normals directly updates rotation but struggles to update Gaussian positions because the normal vector is derived solely from rotation parameters.
The D-Normal Solution: Instead of direct normal supervision, UrbanGS renders a Depth Map and computes D-Normals ( $N^d$ ) by taking the cross-product of the spatial gradients of the depth map.
Dual Supervision Mechanism:
1. D-Normal Constraint: The computed D-Normals are supervised against pseudo-normal priors (from a pretrained model like Dsine). Since D-Normals are derived from depth, this creates a geometric constraint that forces the position parameters of the Gaussians to update correctly to align with the surface.
2. Pseudo Depth Supervision: To ensure the depth maps used for D-Normal calculation are accurate, a separate "Pseudo Depth" (from DepthAnything-v2) supervises the rendered depth map directly.
Adaptive Confidence Weighting: To handle unreliable depth predictions in complex regions, an adaptive confidence weight ( $w_d$ ) is applied. This weight is based on gradient consistency and inverse depth deviation, suppressing supervision in high-error regions to prevent geometric distortion.

B. Spatially Adaptive Gaussian Pruning (SAGP)

To manage memory and redundancy in city-scale scenes, UrbanGS introduces a pruning strategy that is aware of local geometric complexity, rather than relying on global thresholds.

Local Voxel Partitioning: The scene is divided into volumetric cells.
Multi-Factor Importance Score: For each Gaussian, an importance score ( $S_i$ $S_{i}$ ) is calculated as the product of three normalized factors:
1. Ray-Intersection Frequency: How often the Gaussian is hit by training rays (visibility).
2. Opacity: The learnable opacity parameter.
3. Sub-linear Volume Weight: A weight based on the local volume distribution within the voxel cell, which amplifies fine-scale structures while suppressing oversized background Gaussians.
Progressive Pruning: Pruning occurs at specific training intervals (7k, 15k, 25k iterations) to progressively remove redundant primitives while preserving critical geometric details.

C. Unified Partitioning and View Assignment

Boundary Preservation: To eliminate fusion artifacts, common Gaussian primitives at the boundaries of sub-blocks are duplicated across adjacent blocks.
Smart View Assignment: Camera views are assigned to blocks based on two criteria:
1. Geometric Proximity: Whether the camera physically occupies the block.
2. Perceptual Contribution: Whether removing the block's Gaussians significantly degrades the rendered image (measured by SSIM).
This ensures that only relevant views are processed for each block, optimizing computational load.

3. Key Contributions

Depth-Consistent D-Normal Regularizer: A theoretical and practical breakthrough that enables the holistic optimization of both rotation and position parameters of 3D Gaussians, solving the "floating artifact" problem in large-scale reconstruction.
Spatially Adaptive Gaussian Pruning (SAGP): The first pruning framework specifically designed for city-scale 3DGS, which dynamically adjusts density based on local complexity, significantly reducing memory usage without sacrificing detail.
Robust Confidence Mechanism: An adaptive weighting strategy that mitigates the impact of noisy depth priors, ensuring stable multi-view geometric alignment.
Seamless Partitioning Scheme: A unified approach to block-wise training that eliminates boundary artifacts and optimizes view assignment, enabling efficient parallel training on multiple GPUs.

4. Experimental Results

The method was evaluated on multiple large-scale datasets (Mill-19, UrbanScene3D, GauU-Scene) and compared against state-of-the-art methods like CityGS-v2, VCR-GauS, and 2DGS.

Geometric Accuracy: UrbanGS achieves State-of-the-Art (SOTA) performance in surface reconstruction. On the GauU-Scene dataset, it outperforms CityGS-v2 and CityGS-X in F1-scores (e.g., 0.503 vs. 0.492 on Modern Building) and produces cleaner, more detailed meshes.
Rendering Quality: It achieves superior PSNR and SSIM scores in novel view synthesis, with reduced floating artifacts compared to baselines.
Efficiency & Scalability:
- Memory: UrbanGS successfully reconstructs large scenes on 8x NVIDIA RTX A5000 GPUs, whereas competitors like VCR-GauS fail due to OOM errors.
- Training Time: It is significantly faster, completing training on the "Rubble" dataset in ~2 hours and 10 minutes, compared to hours or days for NeRF-based methods and slower 3DGS variants.
- Model Size: SAGP reduces the number of Gaussians by ~30-40% compared to non-pruned baselines while maintaining quality.

5. Significance

UrbanGS represents a major step forward in large-scale 3D scene reconstruction. By theoretically proving and practically demonstrating that depth-normal constraints are necessary for position updates, it resolves a fundamental limitation of current 3DGS methods. The framework provides a systematic solution for high-fidelity, memory-efficient, and geometrically accurate reconstruction of complex urban environments, making it viable for applications in digital twins, autonomous driving simulation, and VR/AR where both visual fidelity and geometric correctness are critical.