ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

Imagine you are looking at a photograph and trying to guess how far away everything is. This is called depth estimation. It's easy for us humans because our brains are wired for it, but it's incredibly hard for computers.

The main problem computers face is the "Scale Problem."

Think of a toy car and a real car. If you take a picture of a toy car on a table, it might look exactly the same size as a real car parked 100 meters away. Without knowing the context, a computer doesn't know if it's looking at a tiny toy close up or a giant car far away.

Most existing AI models are like students who only studied one specific textbook. If they learned on "Indoor" photos, they get confused when shown an "Outdoor" photo, and vice versa. They struggle to generalize.

Enter "ScaleDepth": The Smart Architect.

The authors of this paper, Ruijie Zhu and his team, built a new AI called ScaleDepth. Instead of trying to guess the exact distance of every pixel in one giant, confusing leap, they broke the problem down into two simpler jobs.

Here is how it works, using a creative analogy:

1. The Two-Step Dance: "The Ruler" and "The Map"

Imagine you are trying to draw a map of a city, but you don't know the scale (is 1 inch on the map equal to 1 mile or 1 foot?).

Step A: The Ruler (Scale Prediction)
First, the AI looks at the whole picture and asks, "What kind of world is this? Is this a tiny kitchen or a vast canyon?" It uses a special module called SASP (Semantic-Aware Scale Prediction).
- How it works: It looks at the "vibe" of the image. Is there a bed? (That's a bedroom, usually small). Is there a highway? (That's outdoors, usually huge). It uses a pre-trained "brain" (CLIP) that understands text and images to guess the size of the world. It essentially picks up a ruler and decides, "Okay, for this photo, 1 meter equals X pixels."
Step B: The Map (Relative Depth Estimation)
Once the AI knows the size of the world, it doesn't need to guess the exact distance anymore. It just needs to figure out the shape. "Is the chair in front of the table? Is the tree behind the house?"
- How it works: This is handled by the ARDE (Adaptive Relative Depth Estimation) module. It creates a "relative map" where everything is normalized (0 to 1). It doesn't care if the object is 2 meters or 200 meters away; it just cares about the order of things.

The Magic Trick: Finally, the AI multiplies the Ruler (Scale) by the Map (Relative Depth).

Scale × Relative Shape = Real Metric Depth.

2. Why is this better than the old way?

Old Way (The "One-Size-Fits-All" Hat): Previous models tried to wear one hat that fit both a dollhouse and a skyscraper. It never fit perfectly. They often had to be retrained or have their settings manually adjusted when switching from indoors to outdoors.
ScaleDepth (The Chameleon): This model is flexible. It can look at a photo of a kitchen, realize "Ah, this is small," and adjust its ruler. Then it looks at a photo of a mountain, realizes "Ah, this is huge," and stretches its ruler. It does this automatically without needing to be retrained or told what the scene is.

3. The "Secret Sauce": Text and Image Friendship

The paper uses a clever trick involving text.
Imagine the AI is looking at a picture of a "living room." Instead of just looking at pixels, it whispers to itself, "This looks like a photo of a living room."

It uses a massive database of text-image connections (called CLIP) to understand the meaning of the scene.

If the AI sees a "kitchen," it knows kitchens are usually small.
If it sees a "forest," it knows forests are vast.

By combining what it sees (the structure of the room) with what it knows (the text label of the room), it can predict the scale with incredible accuracy, even for scenes it has never seen before.

4. The Results: A Swiss Army Knife

The researchers tested this on:

Indoors: Bedrooms, kitchens, offices.
Outdoors: Streets, mountains, parks.
Unseen: Things the AI was never trained on (like a specific type of palace or a construction site).

The verdict? ScaleDepth beat the current "champions" of the field. It didn't just work better; it worked faster and with fewer computer resources (parameters). It proved that by breaking the problem into "How big is the world?" and "What does the shape look like?", you can solve the depth estimation puzzle much more effectively.

Summary in a Nutshell

ScaleDepth is like a smart photographer who, before taking a photo, first guesses the size of the room (Scale) and then sketches the layout of the furniture (Relative Depth). By doing these two things separately and then combining them, it can create a perfect 3D model of the world, whether it's a tiny dollhouse or a massive canyon, without needing a manual instruction book for every new scene.

1. Problem Statement

Monocular Metric Depth Estimation (MDE) aims to predict the actual physical distance of objects from a single image. While crucial for applications like autonomous driving and robotics, existing MDE methods face significant challenges:

Scale Variability: Current models are typically trained on specific datasets (e.g., indoor or outdoor) and struggle to generalize across scenes with vastly different depth ranges (e.g., a small kitchen vs. a vast outdoor landscape).
Rigid Depth Ranges: Many methods require predefined depth ranges or separate prediction heads for indoor/outdoor scenes, limiting their flexibility.
Scale Ambiguity: Existing approaches often fail to explicitly model scene scale, relying heavily on large-scale training data or camera parameters to resolve ambiguity, which hinders zero-shot generalization to unseen environments.

2. Methodology: ScaleDepth

The authors propose ScaleDepth, a novel framework that decomposes metric depth estimation into two distinct sub-tasks: Scene Scale Prediction and Relative Depth Estimation. This allows the model to handle varying scales within a unified framework without setting fixed depth ranges.

The architecture consists of three main components:

A. Semantic-Aware Scale Prediction (SASP) Module

This module predicts a global scale factor ( $S$ ) for the scene.

Mechanism: It utilizes Scale Queries ( $Q_s$ ) that interact with image features via masked attention.
Semantic Constraints: To guide the scale prediction, the module leverages the CLIP model. It generates text embeddings for scene categories (e.g., "kitchen," "outdoor scene") using a frozen CLIP text encoder.
Alignment: The scale queries are aligned with these text embeddings via a similarity calculation (Text-Image Similarity). This implicitly combines structural features (from the image) and semantic features (from the text) to predict the scene scale accurately, even for unseen categories, without requiring explicit category labels during inference.

B. Adaptive Relative Depth Estimation (ARDE) Module

This module predicts the relative depth distribution ( $R$ ) within a normalized space (0 to 1).

Bin Queries: Instead of continuous regression, the method uses Bin Queries ( $Q_b$ ) to discretize the depth space into adaptive bins.
Mask Attention: A key innovation is the generation of attention masks. The bin queries interact with image features to identify "depth-related regions." These masks allow the model to focus on specific parts of the image relevant to specific depth bins, improving local structure modeling.
Output: The module outputs a probability distribution over the bins. The final relative depth map is calculated by weighting the bin centers with these probabilities. This map is scale-invariant.

C. Metric Depth Synthesis

The final metric depth map ( $M$ ) is obtained by a simple multiplication:
$M = S \times R$
Where $S$ is the predicted scale factor and $R$ is the normalized relative depth map.

3. Key Contributions

Unified Framework: ScaleDepth is the first method to achieve accurate metric depth estimation for both indoor and outdoor scenes in a single framework without requiring separate heads, fixed depth ranges, or fine-tuning for specific scenes.
Decomposition Strategy: By explicitly separating scale prediction from relative depth estimation, the model addresses the core difficulty of scale ambiguity in MDE.
Semantic-Aware Scale Prediction: The SASP module introduces a novel way to incorporate semantic information via CLIP text-image similarity, allowing the model to infer scene scale based on global semantics and structure.
Adaptive Feature Aggregation: The ARDE module uses mask attention to dynamically aggregate features based on depth distributions, enhancing the modeling of local structures.
State-of-the-Art Performance: The method achieves superior results across indoor, outdoor, unconstrained, and zero-shot (unseen) scenarios.

4. Experimental Results

The authors evaluated ScaleDepth on multiple benchmarks (NYU-Depth V2, KITTI) and eight unseen datasets for zero-shot evaluation.

Indoor (NYU-Depth V2): ScaleDepth-N (trained only on NYU) outperforms state-of-the-art (SOTA) methods like NeWCRFs and iDisc, achieving an ARel of 0.074 and $\delta_1$ of 0.957, despite having fewer parameters than diffusion-based models like VPD.
Outdoor (KITTI): ScaleDepth-K (trained only on KITTI) achieves an ARel of 0.048, surpassing SOTA methods like iDisc and ZoeDepth.
Unconstrained/Zero-Shot: When trained on both datasets (ScaleDepth-NK), the model significantly outperforms ZoeDepth-X-NK by 23.1% in mean relative improvement (ARel) on unseen datasets. It demonstrates robust generalization to completely unseen scenes (e.g., construction sites, parks) without pre-training on extra depth datasets.
Efficiency: The model achieves these results with 216M parameters, which is significantly smaller than many competing SOTA models (e.g., VPD with 872M, NeWCRFs with 270M).

5. Significance

Generalization: ScaleDepth solves the critical problem of generalizing across diverse depth ranges. It eliminates the need for manual depth range settings or scene-specific fine-tuning, making it highly practical for real-world deployment where scene types are unpredictable.
Architecture Design: The decomposition of metric depth into scale and relative depth offers a new paradigm for MDE research, suggesting that explicit scale modeling is more effective than trying to learn scale implicitly through massive datasets.
Semantic Integration: The successful integration of CLIP's semantic understanding into a geometric task (depth estimation) bridges the gap between semantic segmentation and 3D vision, enabling better handling of "unseen" categories.

In conclusion, ScaleDepth represents a significant leap forward in monocular depth estimation, offering a robust, unified, and highly generalizable solution for metric depth prediction in complex, real-world environments.