JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

Imagine you are walking through a giant, complex building with a robot companion. You want to tell the robot, "Find me the dusty old chair in the corner," or "Show me all the pipes running along the ceiling."

For a long time, robots have struggled with this. They are like students who only memorized a specific list of words (like "chair," "table," "door"). If you ask for something not on their list, or if the room is viewed from a weird angle (like a 360-degree fisheye lens), they get confused. They also can't easily connect what they see on a flat photo to the actual 3D world around them.

JOPP-3D is a new "brain" for robots that solves this problem. Think of it as a universal translator that understands both flat photos and 3D space simultaneously, and it speaks the language of human conversation.

Here is how it works, broken down into simple concepts:

1. The "Unfolding" Trick (Tangential Decomposition)

Imagine you have a giant, round balloon covered in a picture of a room (a 360-degree panoramic photo). If you try to flatten that balloon onto a piece of paper, the edges get stretched and distorted, like a map of the world where Greenland looks huge.

Old robots tried to read these stretched maps and got confused. JOPP-3D uses a clever trick: instead of flattening the whole balloon at once, it cuts the balloon into 20 triangular slices (like an icosahedron, a 20-sided die). It flattens each slice individually. Now, instead of a distorted map, the robot sees 20 clear, normal-looking photos of the room. This makes it much easier to recognize objects.

2. The "Ghost Hunter" (3D Instance Extraction)

Once the robot has these clear slices, it needs to figure out where objects actually are in 3D space.

The Problem: In a 3D point cloud (a cloud of dots representing the room), it's hard to tell where one object ends and another begins.
The Solution: JOPP-3D uses a "ghost hunter" tool (based on a technology called SAM). It looks at the 3D dots and says, "Okay, these dots form a chair, and these dots form a wall." It creates invisible 3D "masks" or outlines around every object, even if it has never seen that specific chair before.

3. The "Universal Dictionary" (Open Vocabulary)

This is the magic part. Usually, robots need to be trained on thousands of pictures of "chairs" to know what a chair is. JOPP-3D doesn't need that.

It uses a pre-trained "brain" (like CLIP) that already knows what a "chair," a "dustbin," or a "construction pipe" looks like because it has read millions of books and seen millions of images.
You can simply type: "Show me the pipes."
The robot matches the word "pipes" to the visual features of the pipes in the 3D map. It doesn't need to have been taught "pipe" specifically; it just understands the concept.

4. The "Double-Check" System (3D to Panoramic Alignment)

Finally, the robot needs to make sure its 3D understanding matches what it sees in the 360-degree photos.

Imagine looking at a 3D model of a room and a 360-degree photo of the same room. Sometimes, the 3D model has holes (missing data) where the camera couldn't see.
JOPP-3D acts like a bridge. It takes the 3D labels it figured out and "paints" them back onto the 360-degree photo. If the 3D model missed a spot near a doorway, it uses the depth information from the photo to fill in the gap, ensuring the robot has a complete, consistent understanding of the whole scene.

Why is this a big deal?

No More Memorization: You don't need to retrain the robot every time you add a new type of object to a room. Just ask for it by name.
Seamless Vision: It connects the flat world (photos) and the 3D world (point clouds) perfectly, so the robot understands depth and layout, not just 2D shapes.
Real-World Ready: It works in messy, unstructured places (like construction sites or offices) where you can't always get perfect data.

In short: JOPP-3D is like giving a robot a pair of 3D glasses and a dictionary at the same time. It can look at a room, understand the 3D layout, and instantly find anything you ask for by name, without needing a crash course in every new object it encounters.

1. Problem Statement

Semantic segmentation in complex real-world environments faces two primary challenges:

Data Scarcity: Traditional methods rely heavily on large-scale, manually annotated datasets, which are expensive and impractical to generate for unstructured or dynamic environments.
Limited Generalization: Existing models are typically constrained to fixed class sets (closed-vocabulary) and specific modalities (either 2D images or 3D point clouds). They struggle to generalize to new object categories or handle the joint interpretation of panoramic imagery (360° coverage) and 3D point clouds (geometric fidelity) simultaneously.

The paper addresses the need for a label-free, open-vocabulary framework that can perform semantic segmentation on both panoramic images and 3D point clouds using natural language queries, bridging the gap between 2D and 3D understanding without requiring task-specific training.

2. Methodology

JOPP-3D proposes a unified, three-stage framework that leverages pre-trained Vision-Language Models (VLMs) like CLIP and SAM (Segment Anything Model). The pipeline is designed to be training-free (or weakly-supervised) and consists of the following components:

A. Tangential Decomposition (3.1)

To overcome the geometric distortions inherent in panoramic (equirectangular) images which hinder standard VLMs, the authors introduce a Tangential Decomposition process:

Process: A panoramic RGB-D image is projected onto the 20 faces of a regular icosahedron.
Output: This generates 20 tangential perspective images, each with a 100° Field of View (FoV). This is wider than previous polyhedral approaches (e.g., 73.1°) and reduces boundary discontinuities through inter-view overlap.
3D Reconstruction: Depth maps are corrected and transformed into 3D coordinates for each face. Aggregating these faces across all panoramas in a scene creates a unified, colored 3D point cloud ( $P_{3D}$ ).

B. 3D Instance Extraction and Semantic Alignment (3.2)

To enable open-vocabulary reasoning, the system extracts object-agnostic 3D instances and aligns them with language embeddings:

Instance Proposal: The system generates 3D instance masks using either:
- Mask3D: A supervised model pre-trained on S3DIS (Weakly-supervised variant).
- SAM3D: An unsupervised approach using 2D SAM proposals and depth maps (Unsupervised variant).
2D Projection & Masking: For each 3D instance, the system projects its points onto the tangential perspective images. It selects the top- $K$ views with the most pixel matches.
Feature Aggregation: Using the 2D instance masks (generated by SAM on the tangential crops), the system crops the corresponding image regions. These crops are passed through the CLIP image encoder.
Embedding: The final 3D semantic embedding for an instance is the normalized average of the CLIP features from the top- $K$ views. This allows the 3D instance to be queried by natural language.

C. 3D to Panoramic Semantic Extraction (3.3)

The system projects the learned 3D semantics back to the panoramic domain to create dense semantic maps:

Back-Projection: 3D points are transformed into the panoramic camera coordinate system using camera poses and depth maps.
Nearest-Neighbor Matching: Each pixel in the panoramic image is assigned the semantic label of its nearest 3D neighbor in the semantic point cloud.
Depth Correspondence Consistency: To handle gaps (e.g., through doorways or corridors) where direct projection fails, the method introduces a depth correspondence strategy. It identifies overlapping depth regions between adjacent panoramic scenes and propagates semantic labels from one view to another, ensuring scene-level consistency.

3. Key Contributions

First Joint Open-Vocabulary Framework: JOPP-3D is the first approach to perform open-vocabulary semantic segmentation jointly on 3D point clouds and panoramic images.
Tangential Decomposition Pipeline: A novel method to adapt panoramic inputs for VLMs by decomposing them into wide-field tangential perspectives, mitigating distortion and boundary artifacts without learning deformations.
3D-to-Panoramic Propagation: A depth-correspondence-based method to propagate semantic labels from 3D instances back to panoramic images, ensuring multi-view consistency.
Label-Free/Weakly-Supervised Variants: The framework demonstrates feasibility in both unsupervised (using SAM3D) and weakly-supervised (using Mask3D) settings, eliminating the need for extensive class-specific annotations.

4. Experimental Results

The method was evaluated on two datasets: Stanford-2D-3D-s (indoor scenes) and ToF-360 (Time-of-Flight depth + panoramas).

3D Segmentation (S3DIS):
- JOPP-3D (Weakly-supervised) achieved 80.9% mIoU and 87.0% mAcc, significantly outperforming the previous SOTA (PointTransformerV3 at 73.4% mIoU) and open-vocabulary baselines like OpenMask3D (36.7% mIoU).
- The unsupervised variant (JOPP-3D(u)) achieved 59.4% mIoU, surpassing supervised closed-vocabulary methods in some contexts.
Panoramic Segmentation (Stanford-2D-3D-s):
- JOPP-3D achieved 70.1% mIoU (Closed) and 74.6% Open mIoU, setting a new state-of-the-art.
- It outperformed specialized panoramic methods (e.g., PanoSAMic, 360BEV) and open-vocabulary baselines (OPS, OpenMask3D).
Zero-Shot Performance: On the ToF-360 dataset (challenging zero-shot benchmark), the unsupervised variant showed clear improvements over existing zero-shot methods.
Ablation Studies: Confirmed that removing any component (SAM masking, tangential decomposition, or depth correspondence) significantly degrades performance. Specifically, masking SAM crops was crucial to prevent semantic contamination from background objects in large instances (e.g., floors/ceilings).

5. Significance

Unified Perception: JOPP-3D bridges the gap between 2D panoramic understanding and 3D spatial reasoning, providing a cohesive view of the environment.
Scalability: By leveraging pre-trained foundation models (CLIP, SAM) and avoiding task-specific training, the framework is highly scalable to new environments and object categories without retraining.
Practical Utility: The ability to query specific objects (e.g., "dustbin," "clock") in 3D space and panoramic views, even when ground-truth labels are generic (e.g., "clutter"), demonstrates superior utility for robotics and autonomous systems operating in open-world scenarios.
Efficiency: Despite the computational cost of inference, the method is training-free, offering a favorable trade-off compared to methods requiring massive GPU resources for supervised training.

In conclusion, JOPP-3D represents a significant step forward in open-vocabulary scene understanding, enabling flexible, language-driven interaction with both 2D and 3D representations of the world.