SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Imagine you are trying to teach a super-smart robot (a "Large Vision-Language Model" or LVLM) how to understand a messy living room. The robot has a brain (a Large Language Model) that is great at reading and talking, but it's bad at understanding space.

To help the robot see the room, we give it a 3D scan of the furniture (a "point cloud"). However, the way we currently tell the robot where things are is like trying to describe a 3D room using a flat, 1D list of numbers.

The Problem: The "Flat List" Confusion

Currently, the robot looks at the 3D scan and flattens it into a long line of tokens, like beads on a string. It uses a standard rule called RoPE (Rotary Position Embedding) to remember where each bead is.

Think of RoPE like a conveyor belt in a factory.

The Issue: On a conveyor belt, the only thing that matters is "how far down the belt you are."
The Result: If you put a lamp and a chair right next to each other in the room, but they happen to be far apart on the conveyor belt, the robot thinks they are totally unrelated. Conversely, if two objects are far apart in the room but happen to be close on the belt, the robot thinks they are neighbors.
The Consequence: The robot gets "spatially confused." It might focus all its attention on one tiny corner of the room (a "hotspot") and ignore the rest, or it might think a door is actually a wall because it lost track of the 3D angles.

The Solution: SoPE (The "Globe" Approach)

The authors of this paper, SoPE, say: "Stop using a flat conveyor belt! Let's use a globe."

They propose a new way to tag the 3D points called Spherical Coordinate-based Positional Embedding (SoPE).

Here is the analogy:
Instead of giving the robot a flat list of numbers, they give every object in the room a GPS coordinate on a sphere.

Distance (Radius): How far is the object from the center of the room?
Up/Down (Polar Angle): Is the object on the ceiling, the floor, or the middle?
Left/Right (Azimuthal Angle): Is the object to the north, south, east, or west?

By using this "Globe" system, the robot finally understands that a lamp on the table is physically close to the table, even if they are far apart in the data list. It also understands that a door facing North is different from a door facing South.

The Secret Sauce: The "Radio Tuner" (Multi-Scale Mixing)

Just giving the robot GPS coordinates isn't enough. The room has big things (walls) and tiny things (keys on a table).

The authors added a Multi-Scale Frequency Mixing strategy. Imagine the robot has a radio with multiple tuners:

Low-Frequency Tuner: Listens to the "big picture" (the layout of the room, the walls, the general flow).
High-Frequency Tuner: Listens to the "fine details" (the sharp edges of a book, the specific angle of a cup).

SoPE mixes these signals together. This allows the robot to see the whole room and the tiny details simultaneously, without getting overwhelmed.

Why Does This Matter? (The Real-World Test)

The researchers didn't just stop at theory. They put this new "SoPE brain" into a real robot and tested it in a real house.

Before SoPE: The robot might try to walk into a wall or fail to pick up a small object because it couldn't "see" the spatial relationship correctly.
After SoPE: The robot successfully navigated the room, found specific objects (like a book on a shelf), and moved them around. It understood the shape of the room and the direction things were facing.

Summary

SoPE is like upgrading a robot's brain from a 2D map (which flattens everything and loses depth) to a 3D hologram (which understands distance, height, and direction). This allows AI to finally "see" the world the way humans do: in three dimensions, with a clear sense of where everything is and how it's oriented.

1. Problem Statement

Current 3D Large Vision-Language Models (3D LVLMs) rely heavily on Rotary Position Embedding (RoPE), a mechanism originally designed for 1D text sequences. When applied to 3D point clouds, standard RoPE exhibits two critical limitations that degrade spatial understanding:

Loss of 3D Geometric Structure: RoPE flattens 3D point clouds into a 1D sequence (typically in raster order) and assigns positional indices based solely on sequence order. This breaks the inherent spatial continuity of the point cloud, where physically adjacent points may receive non-adjacent indices.
Direction-Agnostic Modeling: The relative distance calculation in standard RoPE ( $\Delta t = t_1 - t_2$ $Δ t = t_{1} - t_{2}$ ) captures only temporal/sequence distance. It fails to encode angular dependencies and directional variations (e.g., orientation, azimuth, elevation), leading to "spatial perception bias."
- Consequence: Cross-modal attention collapses onto a few "hotspots," ignoring global context and failing to distinguish between tokens with different 3D orientations but similar sequence positions. This results in poor performance in tasks requiring precise localization and orientation awareness.

2. Methodology: SoPE

The authors propose SoPE (Spherical Coordinate-based Positional Embedding), a connector-level positional encoding scheme that reparameterizes 3D tokens to preserve geometric and directional information.

A. Spherical Coordinate Positional Projection

Instead of using a 1D raster index, SoPE maps point-cloud tokens into a 4D spherical coordinate space $(t, r, \theta, \phi)$ :

Temporal Index ( $t$ ): Preserves the original sequence order.
Radial Distance ( $r$ ): $r = \sqrt{x^2 + y^2 + z^2}$ (Distance from origin).
Polar Angle ( $\theta$ ): $\theta = \arccos(z/r)$ (Elevation).
Azimuthal Angle ( $\phi$ ): $\phi = \text{atan2}(y, x)$ (Horizontal orientation).

The relative positional difference is decomposed across all four dimensions: $\Delta t, \Delta r, \Delta \theta, \Delta \phi$ . This allows the model to explicitly model spatial displacement and directional changes.

B. Multi-dimensional Frequency Allocation

The embedding dimension is split into four sub-bands corresponding to the four components $(t, r, \theta, \phi)$ . The authors propose a specific allocation ratio of 24:2:3:3:

High Frequency: Assigned to spherical components ( $r, \theta, \phi$ ) to capture fine-grained spatial variations and directional cues.
Low Frequency: Assigned to the temporal component ( $t$ ) to preserve long-range sequence continuity and stability.
Rationale: This balance prevents the model from overfitting to sequence order while ensuring high sensitivity to 3D geometry.

C. Multi-scale Frequency Mixing Strategy

To handle both fine-grained details (small objects) and large-scale layouts, SoPE introduces a multi-scale phase mixing strategy. For each coordinate component, the phase is computed as a weighted sum of three transforms:

Linear Scale ( $g_{lin}$ ): Preserves absolute positional precision.
Log-compressed Scale ( $g_{log}$ ): Emphasizes local neighborhood structures.
Periodic Scale ( $g_{per}$ ): Captures global patterns and long-range dependencies.

The final phase $\phi_k(u)$ is a fixed-weight average of these scales, allowing the model to simultaneously encode precise locations, local context, and global structure without additional learnable parameters.

3. Key Contributions

Novel Positional Encoding: Introduction of SoPE, the first method to explicitly map 3D point-cloud tokens to spherical coordinates within the RoPE framework, addressing the direction-agnostic nature of standard RoPE.
Geometry-Aware Architecture: A drop-in replacement for vanilla RoPE in 3D LVLMs (specifically integrated into SpatialLM) that unifies spatial location and directional angle modeling.
Multi-Scale Strategy: A lightweight, parameter-free frequency mixing strategy that enhances the model's ability to reason about objects at varying scales and distances.
Real-World Validation: Successful deployment of the model on a physical dual-arm robot, demonstrating its utility in embodied AI tasks (navigation, grasping, and task planning).

4. Experimental Results

The method was evaluated on multiple benchmarks for 3D Layout Estimation and 3D Object Detection.

Layout Estimation (Structured3D):
- SpatialSoPE achieved an IoU2D@0.25 of 88.7 and IoU2D@0.5 of 86.2, outperforming the baseline SpatialLM (+2.2 and +1.6 gains) and state-of-the-art methods like RoomFormer and SceneScript.
3D Object Detection (ARKitScenes & SpatialLM Dataset):
- On ARKitScenes, SpatialSoPE improved F1 scores to 66.1 (IoU@0.25) and 63.2 (IoU@0.50), surpassing SpatialLM and other 3D LVLM baselines (e.g., VoteNet, H3DNet).
- Improvements were consistent across different datasets and training protocols (from-scratch vs. fine-tuning).
Ablation Studies:
- Removing the spherical reparameterization or the specific frequency allocation (24:2:3:3) resulted in significant performance drops, confirming the necessity of the proposed design.
- The multi-scale mixing strategy provided further gains, particularly for detecting small objects and boundaries.
Qualitative Analysis:
- Visualizations showed that SoPE produces more balanced cross-modal attention patterns, reducing "hotspot" bias and improving the detection of small, geometrically intricate objects.

5. Significance

Theoretical Advancement: SoPE bridges the gap between 1D sequence modeling (LLMs) and 3D geometric data, proving that explicit geometric encoding in positional embeddings is crucial for 3D understanding.
Practical Impact: The method significantly enhances the spatial reasoning capabilities of LVLMs, making them more reliable for downstream tasks like robotic navigation and manipulation.
Embodied AI: The successful real-world deployment on a robot validates that improved spatial perception in the model translates directly to better physical interaction and task execution in dynamic environments.

In conclusion, SoPE offers a robust, geometry-aware solution to the spatial perception limitations of current 3D LVLMs, setting a new standard for how positional information should be encoded in multimodal 3D systems.