Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Imagine you have a brilliant, world-class architect named Deep. Deep has spent years studying millions of blueprints and photos of normal rooms and city streets. Because of this massive training, Deep can look at a standard photo and instantly tell you exactly how far away the walls, cars, and people are. Deep is a "Foundational Monocular Depth Estimator" (FMDE).

But here's the problem: Deep has never seen a photo taken with a fisheye lens.

Fisheye lenses are like those fun, curved mirrors at carnivals. They let you see a huge, wide area (like a whole room or a 360-degree street view), but they warp the image. Straight lines look curved, and objects near the edges look stretched and squished.

When you show Deep a fisheye photo, Deep gets confused. Deep tries to apply the rules learned from normal photos to this warped image. The result? Deep guesses the distances wrong. A wall might look like it's floating in space, or a car might look like it's miles away when it's right next to you. Deep is suffering from "covariate shift"—basically, the input data looks too different from what it was trained on.

The Old Solutions (And Why They Failed)

Before this paper, people tried two main ways to fix Deep:

The "Ironing" Method: They tried to take the fisheye photo and mathematically "iron it out" to make it look like a normal photo before showing it to Deep.
- The Problem: Ironing a wrinkled shirt often leaves creases or stretches the fabric. Similarly, "un-distorting" a fisheye image creates digital artifacts (blurry spots, weird stretching) that confuse Deep even more. Plus, if the camera calibration is slightly off, the ironing job is ruined.
The "Retraining" Method: They tried to teach Deep a whole new set of rules specifically for fisheye lenses.
- The Problem: There are very few fisheye photos available compared to normal ones. You can't teach a genius architect to build with a new material if you only have a handful of bricks. Also, if you retrain Deep too much, it might forget how to build normal houses!

The New Solution: "Calibration Tokens"

The authors of this paper came up with a clever, lightweight trick called Calibration Tokens.

Think of Deep's brain as a massive library of knowledge. When Deep looks at a normal photo, it pulls out the right books to figure out distances. When Deep looks at a fisheye photo, it pulls out the wrong books because the "spine" of the book (the image style) looks different.

Instead of rewriting all the books or ironing the photo, the authors invented a special bookmark called a Calibration Token.

The Bookmark: This is a tiny, digital "note" that says, "Hey Deep, this photo is warped like a fisheye lens. Please adjust your reading glasses before you start."
How it Works: They insert this tiny bookmark into the very first layer of Deep's brain (the part that processes the image). This bookmark doesn't change the photo itself; it just changes how Deep interprets the photo's hidden patterns.
The Magic: The bookmark tells Deep, "Ignore the weird curves for a second; imagine this is a normal room again." Deep then uses its existing, super-smart knowledge to figure out the distances correctly.

How Did They Train the Bookmark? (The "Magic Mirror" Trick)

You might ask, "How do you teach a bookmark to fix a fisheye lens if you don't have enough fisheye photos to train on?"

The authors used a brilliant self-supervised trick:

They took a normal photo (which Deep knows perfectly).
They used a computer to artificially warp it into a fake fisheye photo.
They showed the fake fisheye photo to Deep (with the bookmark active).
Deep guessed the depth.
Then, they took Deep's guess and un-warped it back to the original normal shape.
They compared this "un-warped guess" to the original normal photo. If they matched, the bookmark did a good job! If they didn't match, the bookmark learned to adjust itself.

It's like training a translator by giving them a sentence in English, asking them to translate it to French and back to English, and checking if the final English sentence makes sense. They never needed a real fisheye photo to learn; they just needed to learn how to "undo" the distortion in their head.

Why This is a Big Deal

One Size Fits All: They only had to train one single set of bookmarks (tokens). These same tokens work for indoor rooms, outdoor streets, and even different types of fisheye cameras. You don't need a new model for every camera.
Lightweight: The bookmarks are tiny. Adding them to Deep's brain adds almost no extra weight or speed cost. It's like adding a sticky note to a book; the book doesn't get heavier.
No "Ironing" Needed: Because the bookmark adjusts the thinking process, not the image, the original photo stays perfect. No blurry edges, no lost details.
Backwards Compatible: If you take the bookmark out, Deep goes back to being a normal expert on regular photos. It doesn't break anything.

The Bottom Line

This paper is like giving a super-smart architect a pair of smart glasses (the Calibration Tokens) that instantly correct their vision when they look at a warped, fun-house mirror. Instead of rebuilding the architect or trying to straighten the mirror, they just give the architect the right tools to understand the distortion.

Now, the same AI that can navigate a city using a standard camera can also navigate a car using a wide-angle fisheye camera, without needing to be retrained from scratch. It's a simple, elegant, and highly efficient solution to a very messy problem.

1. Problem Statement

Foundational Monocular Depth Estimators (FMDEs) are large-scale models trained on tens of millions of perspective images (e.g., MiDaS, DepthAnything, UniDepth). While they generalize well across diverse 3D scenes, they fail significantly when applied to fisheye images.

The Core Issue: Fisheye cameras introduce severe radial distortion and a different projective geometry compared to standard perspective cameras. This creates a covariate shift in the input data distribution.
Limitations of Existing Solutions:
- Image Recalibration/Undistortion: Converting fisheye images to perspective views via map projection introduces spatial artifacts (stretching, aliasing, cropping) and latency. These artifacts still cause a covariate shift, degrading the performance of pre-trained FMDEs.
- Training from Scratch: Public datasets for fisheye cameras are orders of magnitude smaller than perspective datasets, making it infeasible to train large FMDEs specifically for fisheye lenses.
- Fine-tuning: Fine-tuning an existing FMDE on fisheye data risks parameter drift, where the model loses its generalizability to perspective images and requires separate models for different camera types, increasing operational overhead.

2. Methodology: Calibration Tokens

The authors propose a lightweight, self-supervised adaptation mechanism called Calibration Tokens to align the latent representations of fisheye images with those of perspective images without retraining the entire model.

A. Core Concept

Instead of modifying the image space (undistortion) or the model weights (fine-tuning), the method operates in the latent embedding space. The hypothesis is that if the latent embeddings of fisheye images can be modulated to match the distribution of perspective image embeddings, the pre-trained decoder of the FMDE can accurately predict depth.

B. Architecture

Token Insertion: Leveraging the Transformer-based architecture common in modern FMDEs, the authors introduce a set of learnable Calibration Tokens ( $\phi$ ).
Layer-wise Modulation: Unlike standard approaches that add a single token at the input, this method appends a unique set of tokens to the input sequence of every encoder layer.
Mechanism: These tokens interact with the patch embeddings via the self-attention mechanism. They act as a "recalibration" signal, shifting the fisheye embeddings toward the perspective distribution.
Inference: The tokens are discarded after the encoder; the decoder remains frozen. The model requires no camera intrinsics at inference time, as the tokens implicitly encode the distortion characteristics.

C. Self-Supervised Training Objective

The method avoids the need for large fisheye datasets by synthesizing fisheye images from abundant perspective datasets (e.g., NYUv2, Waymo).

Synthesis: Perspective images are artificially distorted using the Kannala & Brandt fisheye model to create synthetic fisheye pairs.
Forward Pass: The model processes the synthetic fisheye image (with Calibration Tokens) to predict a depth map.
Inverse Warping (The "Undo" Step): Instead of warping the ground truth (perspective depth) to the fisheye frame (which causes information loss), the predicted fisheye depth map is undistorted (re-projected) back to the original perspective reference frame.
Loss Calculation: A LogL1 loss is computed between the re-projected fisheye depth and the high-fidelity depth estimated by the FMDE on the original perspective image.
- Formula: $\mathcal{L} = \sum \log(|\tilde{d}(x) - T^{-1}(\hat{d}(x))| + 1)$
- This ensures the supervision signal remains lossless and artifact-free.

3. Key Contributions

Calibration Tokens: A novel, lightweight adaptation mechanism that modulates latent embeddings to bridge the gap between perspective and fisheye domains without retraining the backbone.
Self-Supervised Training Strategy: A training pipeline that leverages massive perspective datasets by synthesizing fisheye inputs and enforcing consistency in the perspective output space, eliminating the need for fisheye ground truth.
Layer-wise Modulation: Demonstrating that appending tokens at every encoder layer is superior to a single input token, allowing for deeper feature alignment.
Universal Applicability: The method works across multiple state-of-the-art FMDEs (MiDaS, DepthAnything, UniDepth) and generalizes to both indoor and outdoor scenes using a single set of tokens.

4. Experimental Results

The method was evaluated on ScanNet++ (Indoor) and KITTI-360 (Outdoor) datasets, comparing against baselines like DepthAnyCamera and FoVA-Depth.

Performance Gains:
- Indoor (ScanNet++): UniDepth + Calibration Tokens achieved an RMSE of 0.244, outperforming the baseline DepthAnyCamera (0.275) and FoVA-Depth (0.285). MiDaS and DepthAnything saw improvements of 12% and 17% in RMSE respectively.
- Outdoor (KITTI-360): The method consistently improved RMSE across all models, demonstrating robustness to extreme distortions (>180° FOV).
Comparison with Fine-tuning: Fine-tuning the entire model resulted in significant performance drops (e.g., MiDaS RMSE jumped from 0.446 to 2.178 on ScanNet++), confirming that Calibration Tokens preserve the model's original generalizability.
Efficiency:
- Parameters: Adds only ~0.05% additional parameters (less than 1MB memory).
- Latency: Increases inference time by less than 1ms (<1%).
- Backward Compatibility: Removing the tokens restores the model's original performance on perspective images.

5. Significance and Impact

Operational Efficiency: Enables a single FMDE to handle mixed camera systems (e.g., autonomous vehicles with both perspective and fisheye lenses) without maintaining separate models or complex calibration pipelines.
Lossless Adaptation: By avoiding spatial re-projection of images, the method preserves raw pixel information and avoids the artifacts that plague traditional undistortion techniques.
Scalability: The self-supervised approach allows leveraging the vast ecosystem of perspective data to solve the data-scarce fisheye problem, making it highly practical for real-world deployment.
Generalization: The ability to use one set of tokens for both indoor and outdoor environments suggests a robust solution for the broader challenge of domain adaptation in vision foundation models.

In conclusion, Calibration Tokens provide an elegant, efficient, and highly effective solution to extend the capabilities of foundational depth models to fisheye cameras, overcoming the limitations of covariate shift without the computational or data costs of traditional retraining or fine-tuning.