RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

Imagine you have a brilliant, world-class architect who has spent their entire career designing beautiful, standard houses. They are an expert at understanding perspective, depth, and how walls meet floors in a normal room. This architect is your AI model.

Now, you ask this architect to design a house that wraps all the way around in a perfect circle—a 360-degree panoramic room.

The problem? The architect gets confused. Their brain is wired for flat, straight lines (perspective), but a 360-degree view is curved and distorted. If you just force them to look at the circle, they might get dizzy and make mistakes. If you try to teach them everything from scratch using thousands of new blueprints (panoramic data), it takes forever and costs a fortune.

RePer-360 is the clever solution the researchers invented to fix this. Here is how it works, broken down into simple concepts:

1. The Problem: The "Fishbowl" Effect

Standard cameras see the world like a flat painting. 360-degree cameras see the world like a fishbowl or a globe. When you flatten a globe onto a piece of paper (like a map), the poles get stretched out and the equator gets squished.

The Issue: The AI's "brain" (trained on flat photos) sees these stretched areas as weird, broken shapes. It tries to apply flat-house rules to a curved world, leading to depth errors (like thinking a wall is closer than it really is).

2. The Old Ways (And Why They Failed)

Before this paper, researchers tried two main things:

The "Patchwork" Method: They chopped the 360-degree image into tiny square pieces (like cutting a pizza into slices), asked the architect to look at each slice individually, and then glued the answers back together.
- The Flaw: This is slow, clunky, and often leaves ugly seams where the slices don't match up.
The "Re-Training" Method: They tried to re-teach the architect from scratch using thousands of panoramic photos.
- The Flaw: This requires a massive amount of data (like needing 120,000 blueprints) and risks making the architect forget the great skills they already had about standard houses.

3. The RePer-360 Solution: "The Self-Modulation Guide"

Instead of forcing the architect to learn a new job or chopping up the image, RePer-360 acts like a specialized translator or a guide that sits next to the architect.

Here is the magic trick, using an analogy:

The Two Lenses (ERP and CP)

Imagine the architect is looking at the room through two different glasses at the same time:

Glasses A (ERP): The standard panoramic view. It sees the whole room but is distorted (stretched at the top and bottom).
Glasses B (CP): A "Cubemap" view. Imagine the room is inside a cube. This view looks at the room from six flat sides (Front, Back, Left, Right, Top, Bottom). It sees the room in perfect, undistorted squares, but it breaks the room into six separate pieces.

The "Geometry-Aligned Guidance" (The Translator)

The system takes the "six-piece" view (Glasses B) and the "distorted whole" view (Glasses A) and compares them.

It notices: "Hey, the top of the room looks stretched in Glasses A, but it looks normal in Glasses B."
It creates a map of corrections based on this comparison. It doesn't force the architect to see the six pieces; instead, it whispers to the architect: "Hey, when you see this stretched area, remember it's actually flat. Adjust your thinking slightly."

The "Self-Modulation" (The Volume Knob)

This is the most important part. The system doesn't overwrite the architect's brain. Instead, it uses Self-Conditioned AdaLN-Zero.

Think of the architect's brain as a radio playing a perfect song (the pre-trained knowledge).
The new system doesn't change the song; it just turns up the volume or bass on specific notes depending on the room's shape.
It adds tiny "scaling factors" (like a volume knob) to the architect's neurons. If the architect is looking at a distorted ceiling, the system turns the knob to say, "Don't panic, this is just a distortion, not a real depth change."
Crucially: It starts with the volume knob set to zero. This means at the very beginning, the architect acts exactly as they did before (safe and stable). As it learns, it slowly turns the knob up only where needed.

4. The "Cube Consistency" Rule (The Safety Net)

To make sure the architect doesn't get confused, the system adds a rule: "If you think the ceiling is close in the 'Front' view, you must think it's the same distance in the 'Top' view."
This is called the E2C Consistency Loss. It forces the AI to agree with itself across different angles, preventing it from hallucinating weird depths just because the image looks stretched.

The Result: Super Efficient and Accurate

Because RePer-360 is so smart about how it adjusts the architect:

It needs almost no data: It learned to do this with only 1% of the data other methods needed (1,000 images instead of 120,000).
It keeps the original skills: It didn't forget how to see depth in normal rooms; it just learned how to handle the 360-degree twist.
It's faster: No need to chop the image into pieces and glue it back together.

In a nutshell: RePer-360 doesn't try to rebuild the AI's brain. Instead, it gives the AI a pair of smart glasses and a set of volume knobs, allowing it to instantly understand 360-degree worlds while keeping all the knowledge it already had about the flat world.

1. Problem Statement

Recent depth foundation models (e.g., Depth Anything Models) trained on perspective images achieve state-of-the-art performance but fail to generalize effectively to 360° panoramic images. This performance drop is attributed to a severe prior mismatch:

Geometric Discrepancy: Panoramic images suffer from severe distortions (e.g., at poles in Equirectangular Projection) that violate the perspective-domain statistics learned by pretrained models.
Limitations of Existing Solutions:
- Projection-based Fusion: Methods that split panoramas into perspective patches and fuse results often introduce artifacts, ignore global spherical geometry, and suffer from high computational latency.
- Direct Fine-tuning: Methods that fine-tune perspective models on panoramic data often require massive amounts of 360° data. Without explicit distortion modeling, they risk representation drift, where the model overwrites valuable pretrained perspective priors, leading to poor generalization.

2. Methodology: RePer-360

The authors propose RePer-360, a distortion-aware self-modulation framework. Instead of fusing features from different projections (which perturbs pretrained statistics), the method uses complementary projections as guidance signals to modulate the pretrained backbone via normalization layers.

The framework consists of three core components:

A. Geometry-Aligned Guidance (GAG) Module

Input: The framework processes the panorama through two branches:
1. ERP Branch: Extracts features from the Equirectangular Projection (global context).
2. CP Branch: Extracts features from the Cubemap Projection (local perspective-aligned details).
Mechanism:
- Statistical Alignment: The CP features are aligned to the ERP domain using parameter-free affine normalization (matching mean and standard deviation) to preserve local structural details without distribution shift.
- Adaptive Gating: A lightweight network generates a spatially adaptive gating map ( $G$ ) that balances the aligned CP features and original ERP features.
- Output: A guidance signal ( $F_{GAG}$ ) that encodes geometric correspondences and distortion differences, serving as a "structured guidance" rather than a fused feature map.

B. Self-Conditioned AdaLN-Zero (SCAdaLN-Zero)

Core Innovation: Instead of injecting guidance via residual connections or cross-attention (which can destabilize training), RePer-360 injects the guidance into the normalization layers of the frozen backbone.
Mechanism:
- The GAG signal is processed through a lightweight network (SiLU + Depthwise Separable Conv) to generate scaling ( $\gamma$ ) and shifting ( $\beta$ ) parameters.
- These parameters modulate the features within the Transformer blocks (Attention and MLP layers) using an AdaLN-Zero mechanism.
- Zero-Initialization: The modulation weights are initialized to zero, ensuring the model starts as a standard perspective model and gradually learns to adapt to panoramic distortions, ensuring training stability.
Benefit: This allows for controlled adaptation where the model corrects for distortions without overwriting the rich geometric priors learned from perspective data.

C. E2C Consistency Loss (ECCLoss)

Problem: In Equirectangular Projection (ERP), polar regions are heavily distorted and occupy disproportionate pixel space, biasing the loss function.
Solution: The predicted depth and ground truth are transformed into the Cubemap domain.
Loss Function: A Scale-Shift Invariant Mean Absolute Error (SSI-MAE) is applied in the cubemap domain. This ensures that depth learning is balanced across polar and equatorial regions and enforces geometric consistency across the six faces of the cubemap.

3. Key Contributions

Paradigm Shift: The paper reframes panoramic adaptation from "feature fusion" to "guidance-based domain adaptation." It argues that complementary projections should guide the model via modulation rather than being hard-fused.
Novel Architecture: Introduction of RePer-360, featuring the SCAdaLN-Zero module, which enables stable, distortion-aware adaptation by modulating normalization parameters rather than feature values.
Data Efficiency: The method achieves superior performance using only ~1% of the training data required by previous state-of-the-art (SOTA) methods (e.g., 1k–8k images vs. 120k images).
Loss Innovation: The ECCLoss mitigates spherical distortion bias by enforcing consistency in the perspective-aligned cubemap domain.

4. Experimental Results

The method was evaluated on Matterport3D and Stanford2D3D datasets, comparing against SOTA methods like PanDA, BiFuse, and Depth Anywhere.

Quantitative Performance:
- In-Domain: RePer-360 outperforms the previous SOTA (PanDA-L) significantly. On Matterport3D, it improves RMSE by 17.3% and Abs Rel by 12.3% despite using far less training data.
- Zero-Shot: Trained only on synthetic data (Structured3D, Deep360), it achieves a 42.3% improvement in Abs Rel on Stanford2D3D compared to PanDA-L.
Qualitative Performance:
- Visual results show RePer-360 preserves fine structural details and handles severe panoramic distortions (e.g., ceiling/wall transitions) better than PanDA-L, which often misinterprets textures as depth variations.
Ablation Studies:
- Replacing the proposed modulation with explicit feature fusion (Cross-Attention) or removing the GAG module leads to significant performance drops, validating the necessity of the self-modulation approach.
- Feature visualization confirms that RePer-360 maintains high similarity to the frozen backbone (controlled drift), whereas other methods exhibit unstable representation shifts.

5. Significance

RePer-360 addresses a critical bottleneck in 360° vision: the difficulty of adapting powerful perspective foundation models to panoramic domains without massive datasets.

Efficiency: It demonstrates that parameter-efficient adaptation (via modulation) is superior to full fine-tuning or complex fusion pipelines for this task.
Generalization: By preserving pretrained priors, the model generalizes better to unseen environments and lighting conditions.
Future Impact: The "self-modulation" strategy offers a principled approach for adapting pretrained visual backbones to any geometrically mismatched domain, potentially applicable to other tasks like segmentation or pose estimation in spherical imagery.

RePer-360: Releasing Perspective Priors for 360∘^\circ∘ Depth Estimation via Self-Modulation

1. The Problem: The "Fishbowl" Effect

2. The Old Ways (And Why They Failed)

3. The RePer-360 Solution: "The Self-Modulation Guide"

The Two Lenses (ERP and CP)

The "Geometry-Aligned Guidance" (The Translator)

The "Self-Modulation" (The Volume Knob)

4. The "Cube Consistency" Rule (The Safety Net)

The Result: Super Efficient and Accurate

1. Problem Statement

2. Methodology: RePer-360

A. Geometry-Aligned Guidance (GAG) Module

B. Self-Conditioned AdaLN-Zero (SCAdaLN-Zero)

C. E2C Consistency Loss (ECCLoss)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes

RePer-360: Releasing Perspective Priors for 360 $^\circ$ Depth Estimation via Self-Modulation