RePer-360: Releasing Perspective Priors for 360^\circ Depth Estimation via Self-Modulation

RePer-360 is a distortion-aware self-modulation framework that adapts perspective-trained depth foundation models to 360° panoramic depth estimation by preserving pretrained priors through a lightweight geometry-aligned guidance module and a Self-Conditioned AdaLN-Zero mechanism, achieving superior performance with only 1% of the training data.

Cheng Guan, Chunyu Lin, Zhijie Shen, Junsong Zhang, Jiyuan Wang

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, world-class architect who has spent their entire career designing beautiful, standard houses. They are an expert at understanding perspective, depth, and how walls meet floors in a normal room. This architect is your AI model.

Now, you ask this architect to design a house that wraps all the way around in a perfect circle—a 360-degree panoramic room.

The problem? The architect gets confused. Their brain is wired for flat, straight lines (perspective), but a 360-degree view is curved and distorted. If you just force them to look at the circle, they might get dizzy and make mistakes. If you try to teach them everything from scratch using thousands of new blueprints (panoramic data), it takes forever and costs a fortune.

RePer-360 is the clever solution the researchers invented to fix this. Here is how it works, broken down into simple concepts:

1. The Problem: The "Fishbowl" Effect

Standard cameras see the world like a flat painting. 360-degree cameras see the world like a fishbowl or a globe. When you flatten a globe onto a piece of paper (like a map), the poles get stretched out and the equator gets squished.

  • The Issue: The AI's "brain" (trained on flat photos) sees these stretched areas as weird, broken shapes. It tries to apply flat-house rules to a curved world, leading to depth errors (like thinking a wall is closer than it really is).

2. The Old Ways (And Why They Failed)

Before this paper, researchers tried two main things:

  • The "Patchwork" Method: They chopped the 360-degree image into tiny square pieces (like cutting a pizza into slices), asked the architect to look at each slice individually, and then glued the answers back together.
    • The Flaw: This is slow, clunky, and often leaves ugly seams where the slices don't match up.
  • The "Re-Training" Method: They tried to re-teach the architect from scratch using thousands of panoramic photos.
    • The Flaw: This requires a massive amount of data (like needing 120,000 blueprints) and risks making the architect forget the great skills they already had about standard houses.

3. The RePer-360 Solution: "The Self-Modulation Guide"

Instead of forcing the architect to learn a new job or chopping up the image, RePer-360 acts like a specialized translator or a guide that sits next to the architect.

Here is the magic trick, using an analogy:

The Two Lenses (ERP and CP)

Imagine the architect is looking at the room through two different glasses at the same time:

  1. Glasses A (ERP): The standard panoramic view. It sees the whole room but is distorted (stretched at the top and bottom).
  2. Glasses B (CP): A "Cubemap" view. Imagine the room is inside a cube. This view looks at the room from six flat sides (Front, Back, Left, Right, Top, Bottom). It sees the room in perfect, undistorted squares, but it breaks the room into six separate pieces.

The "Geometry-Aligned Guidance" (The Translator)

The system takes the "six-piece" view (Glasses B) and the "distorted whole" view (Glasses A) and compares them.

  • It notices: "Hey, the top of the room looks stretched in Glasses A, but it looks normal in Glasses B."
  • It creates a map of corrections based on this comparison. It doesn't force the architect to see the six pieces; instead, it whispers to the architect: "Hey, when you see this stretched area, remember it's actually flat. Adjust your thinking slightly."

The "Self-Modulation" (The Volume Knob)

This is the most important part. The system doesn't overwrite the architect's brain. Instead, it uses Self-Conditioned AdaLN-Zero.

  • Think of the architect's brain as a radio playing a perfect song (the pre-trained knowledge).
  • The new system doesn't change the song; it just turns up the volume or bass on specific notes depending on the room's shape.
  • It adds tiny "scaling factors" (like a volume knob) to the architect's neurons. If the architect is looking at a distorted ceiling, the system turns the knob to say, "Don't panic, this is just a distortion, not a real depth change."
  • Crucially: It starts with the volume knob set to zero. This means at the very beginning, the architect acts exactly as they did before (safe and stable). As it learns, it slowly turns the knob up only where needed.

4. The "Cube Consistency" Rule (The Safety Net)

To make sure the architect doesn't get confused, the system adds a rule: "If you think the ceiling is close in the 'Front' view, you must think it's the same distance in the 'Top' view."
This is called the E2C Consistency Loss. It forces the AI to agree with itself across different angles, preventing it from hallucinating weird depths just because the image looks stretched.

The Result: Super Efficient and Accurate

Because RePer-360 is so smart about how it adjusts the architect:

  • It needs almost no data: It learned to do this with only 1% of the data other methods needed (1,000 images instead of 120,000).
  • It keeps the original skills: It didn't forget how to see depth in normal rooms; it just learned how to handle the 360-degree twist.
  • It's faster: No need to chop the image into pieces and glue it back together.

In a nutshell: RePer-360 doesn't try to rebuild the AI's brain. Instead, it gives the AI a pair of smart glasses and a set of volume knobs, allowing it to instantly understand 360-degree worlds while keeping all the knowledge it already had about the flat world.