CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis

Imagine you have a 360-degree camera that takes a single, all-around photo of a room. Now, imagine you want to step inside that photo and look around from a completely different angle, as if you were actually there. This is called Novel View Synthesis.

For a long time, doing this with just one or a few photos was like trying to build a 3D house out of a single 2D blueprint: you'd end up with holes in the walls and a roof that didn't quite fit.

Enter CylinderSplat, a new AI method that solves this problem. Here is how it works, explained through simple analogies.

1. The Problem: The "Flat Map" vs. The "Round World"

Most 3D computer vision tools are built like flat maps (Cartesian grids). They are great for small rooms or pinhole cameras, but when you try to use a flat map to describe a 360-degree world, things get weird.

The Analogy: Imagine trying to wrap a flat piece of paper perfectly around a basketball. You have to stretch the paper at the top and bottom, and it tears or bunches up at the sides.
The Result: Existing AI methods try to force 360-degree photos into these flat grids, leading to blurry, stretched, or distorted images, especially when looking at the floor or ceiling.

2. The Solution: The "Cylindrical Triplane"

The authors of this paper realized that instead of using a flat map, they should use a shape that matches the world: a cylinder.

The Analogy: Think of a tinfoil can or a soda can. If you wrap a label around a soda can, it fits perfectly without stretching or tearing.
The Innovation: They created a new way to store 3D data called a Cylindrical Triplane. Instead of three flat sheets of paper (X, Y, Z), they use three sheets wrapped around a cylinder.
- One sheet wraps around the walls (perfect for straight walls in houses).
- One sheet covers the floor and ceiling (perfect for flat ground).
- This matches how most real-world buildings are built (the "Manhattan World" assumption), making the math much easier and the results much sharper.

3. The Two-Brain System (Dual-Branch Architecture)

The AI doesn't just use one trick; it uses two "brains" working together to build the 3D scene.

Brain A: The "Pixel Detective" (Pixel Branch)

What it does: This brain looks at the photos you gave it and finds the things it can clearly see. It's like a detective who only reports on the clues that are right in front of their face.
The Limitation: If you only have one photo, the detective can't see what's behind the sofa or in the corner. The 3D model would have big holes.

Brain B: The "Imaginative Architect" (Volume Branch)

What it does: This brain uses the Cylindrical Triplane to fill in the blanks. It looks at the empty spaces and uses its knowledge of how rooms usually look to "hallucinate" (guess) what should be there.
The Magic: Because it's using the cylindrical shape, it guesses the walls and floors correctly, rather than stretching them like a flat map would.

Together: The "Detective" builds the sharp, clear parts of the image, and the "Architect" fills in the dark, hidden corners. The result is a complete, solid 3D world.

4. Why This Matters

Speed: Old methods took hours to build a 3D scene from scratch. CylinderSplat does it in a fraction of a second (feed-forward), like snapping a photo.
Flexibility: It works whether you give it one photo (like a tourist snapshot) or many photos (like a drone flying through a room).
Realism: It handles the tricky parts of 360-degree photos—like the floor and ceiling—without the weird distortions that plague other AI tools.

Summary

Think of CylinderSplat as a master builder who finally figured out that to build a 3D house from a 360-degree photo, you shouldn't use a flat blueprint. Instead, you should use a cylindrical mold that fits the shape of the world perfectly. By combining a sharp-eyed detective with a creative architect, it can instantly turn a flat, 360-degree picture into a room you can walk through, look around, and explore.

1. Problem Statement

The paper addresses the challenges of Panoramic Novel View Synthesis (NVS) using 3D Gaussian Splatting (3DGS). While 3DGS has revolutionized real-time rendering for pinhole cameras, adapting it to 360° panoramic imagery faces two primary limitations in existing feed-forward methods:

Geometric Distortion & Aliasing: Standard volumetric representations (like Cartesian Triplanes) are ill-suited for 360° scenes. They struggle to capture the inherent geometry of panoramic data, leading to severe distortion, especially in "Manhattan-world" environments (indoor/urban scenes with orthogonal walls and floors).
Occlusion and Sparse Views: Existing feed-forward methods often rely on multi-view cost volumes to refine geometry. These approaches fail in sparse-view or single-view scenarios, particularly when large baselines cause significant occlusions, resulting in holes, artifacts, and inaccurate depth in the reconstructed scene.

2. Methodology: CylinderSplat

The authors propose CylinderSplat, a feed-forward framework designed to handle variable numbers of input views (from single to multiple panoramas) using a dual-branch architecture trained via a three-stage curriculum.

A. Core Innovation: Cylindrical Triplane Representation

Instead of the standard Cartesian or Spherical coordinate systems, the method introduces a Cylindrical Triplane representation.

Motivation: Inspired by the "Manhattan-world" assumption, cylindrical coordinates naturally align with the vertical walls and horizontal floors of man-made environments.
Structure: For each input camera, a local cylindrical volume is defined with dimensions $(R, \Theta, Z)$ . The Triplane consists of three orthogonal feature planes: $F_{r\theta}$ , $F_{\theta z}$ , and $F_{zr}$ .
Efficiency: This representation compresses the dense 3D feature grid, reducing storage complexity from $O(\Theta \cdot Z \cdot R)$ to $O(\Theta \cdot Z + Z \cdot R + R \cdot \Theta)$ .
Advantage: It minimizes distortion at the poles (unlike spherical) and avoids the stretching artifacts of Cartesian grids when projected onto equirectangular panoramas.

B. Dual-Branch Architecture

Pixel Branch (Observation):
- Function: Reconstructs well-observed regions using an attention-based mechanism (self-attention within frames, cross-attention across frames).
- Process: It aggregates multi-view context to predict a refined depth map and feature map, unprojecting pixels into a 3D feature point cloud ( $P_{feat}$ ) to generate high-quality Gaussians ( $G_{pixel}$ ).
- Limitation: It fails in occluded or sparsely viewed areas, leaving holes in the reconstruction.
Volume Branch (Completion):
- Function: Completes the geometry in occluded regions using the Cylindrical Triplane.
- Process:
  - Initialization: Local Triplanes are initialized with features from the Pixel Branch falling within their volume.
  - Refinement: Uses Cross-Plane Attention to exchange information between the three feature planes and Triplane-to-Image Attention to incorporate visual evidence from source images.
  - Decoding: Samples a dense grid within the cylindrical volume, queries the refined Triplane features, and uses an MLP to predict local Gaussian parameters (position offsets, anisotropic scales, rotation, opacity).
  - Coordinate Transformation: A critical step involves transforming local cylindrical attributes (offsets and scales) into global Cartesian coordinates using a Jacobian matrix to ensure correct rendering in the standard 3DGS rasterizer.
  - RGB Retrieval: To ensure photorealistic colors, the method employs an RGB Retrieval mechanism. It projects Gaussian centers into source views, computes visibility scores based on depth priors, and aggregates colors from the most visible, unoccluded views.

C. Training Curriculum

The model is trained in three stages to ensure stability and performance:

Stage 1: Train the Pixel Branch only (establishes a high-quality baseline for visible regions).
Stage 2: Freeze the Pixel Branch and train the Volume Branch (learns to complete occluded geometry).
Stage 3: Jointly fine-tune both branches to merge details and completeness into a single high-fidelity scene.

3. Key Contributions

Cylindrical Triplane Representation: A novel geometric representation specifically designed for panoramic 3DGS that adheres to the Manhattan-world assumption, significantly reducing distortion compared to Cartesian or Spherical alternatives.
Dual-Branch Feed-Forward Framework: A robust architecture that combines pixel-based reconstruction for observed areas with volume-based completion for occluded areas, enabling flexible handling of single or multiple input views.
Direct Panoramic Rendering: The use of a specialized 3DGS rasterizer that renders full equirectangular images in a single pass, avoiding the inefficiencies of stitching cubemaps.
State-of-the-Art Performance: Demonstrated superior results in both reconstruction quality (WS-PSNR, SSIM) and geometric accuracy (PCC) across synthetic and real-world datasets.

4. Experimental Results

The method was evaluated on synthetic datasets (Matterport3D, Replica, Residential) and a real-world dataset (360Loc), as well as a large-scale Kansas dataset.

Quantitative Performance:
- Single-View & Two-View: CylinderSplat outperforms SOTA methods (PanSplat, Splatter360, OmniScene, MVSplat) across all metrics. For example, on Matterport3D (2.0m baseline), it achieved a PCC of 0.851 vs. 0.732 for OmniScene, and WS-PSNR of 23.76 vs. 22.75.
- Wide-Baseline Scenarios: In extreme sparse-view scenarios (e.g., 20m–30m baselines in the Kansas dataset), the performance gap widens significantly, with CylinderSplat outperforming the best competitor by +3.95 dB in WS-PSNR.
- Geometry: The method shows superior depth consistency, particularly on floors and ceilings where other methods produce artifacts.
Ablation Studies:
- Coordinate System: Cylindrical Triplanes significantly outperformed both Cartesian and Spherical variants.
- Training Strategy: The three-stage curriculum yielded better results than end-to-end training or using branches in isolation.
- RGB Retrieval: Essential for recovering high-frequency details that volume features alone miss.
Efficiency:
- The model is lightweight (13.6M parameters) and faster than competitors (0.29s inference time vs. 0.32s–0.54s for others).
- It scales effectively to 3 and 4 input views without architectural changes.

5. Significance

CylinderSplat represents a significant advancement in panoramic 3D reconstruction by bridging the gap between the efficiency of feed-forward 3DGS and the geometric complexities of 360° imagery.

Practical Impact: It enables high-fidelity, real-time novel view synthesis from sparse inputs (even single views), which is crucial for VR/AR applications, autonomous driving, and digital twins.
Theoretical Contribution: It challenges the dominance of Cartesian representations in 3DGS by demonstrating that coordinate systems aligned with scene priors (Manhattan-world) and camera geometry (cylindrical) yield superior results.
Robustness: The framework's ability to handle dynamic scenes (via specific initialization strategies) and extreme baselines makes it a robust solution for real-world deployment where data is often incomplete or noisy.