UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Imagine you are a robot trying to navigate a new city. You have a camera, but it's like looking at the world through a foggy, distorted window. You can see buildings and trees, but you have no idea how far away they are or how big they actually are. Is that car 10 feet away or 100? Is that building a toy model or a skyscraper?

This is the problem UniScale solves. It's a new "brain" for robots that lets them understand the 3D world in real, measurable sizes, even when they don't have perfect information.

Here is how it works, using some everyday analogies:

1. The Problem: The "Flat Map" vs. The "Real World"

Most current AI models that look at photos are like artists sketching a map. They are great at drawing the shape of things (this is a tree, that is a road), but they are terrible at knowing the scale. They might draw a tree that looks like a giant oak, but in their "mind," it could be the size of a potted plant.

For a robot, this is dangerous. If a robot thinks a wall is a toy, it might crash into it. If it thinks a gap is a bridge, it might fall.

2. The Solution: The "Universal Translator"

UniScale is like a universal translator that can speak two languages at once:

Language A: "What does this look like?" (The shape and depth).
Language B: "How big is this in real life?" (The metric scale).

It takes a sequence of photos and instantly builds a 3D model where a car is actually 4.5 meters long, not just "a car-shaped blob."

3. The Secret Sauce: "Prior Injection" (The Cheat Sheet)

Sometimes, the robot does know some things. Maybe it knows its own camera settings (how wide the lens is) or it has a GPS that tells it where it is.

Old AI: Tries to guess everything from scratch, ignoring the cheat sheet.
UniScale: Has a smart librarian. When the robot says, "Hey, I know my camera lens is 50mm," UniScale doesn't just shove that fact into the brain randomly. It has a specific shelf for "Camera Info" and a specific shelf for "Robot Position." It takes the new info and places it exactly where it helps the most.

This is called Semantic-Aware Prior Injection. Think of it like a chef who doesn't just throw all ingredients into a blender. If you give them salt, they put it in the soup. If you give them a spice, they put it in the rub. UniScale knows exactly where to put the extra information to make the final dish (the 3D map) taste perfect.

4. The "Scale Head": The Ruler in the Brain

The most unique part of UniScale is a specific module called the Metric-Scale Head.

Imagine you are building a LEGO castle. Most AI models build a castle that looks perfect from the front, but if you measure it, the bricks are the wrong size.
UniScale has a specialized ruler attached to its brain. While it's building the castle, this ruler constantly checks: "Wait, if this door is 2 meters high, then this whole tower must be 10 meters tall."

It uses clues from the whole picture (the "global context") and the specific camera data to lock in the real-world size. It doesn't just guess; it calculates the "real-world size" as a separate, dedicated task.

5. Why This Matters for Robots

No Need to Start Over: You don't need to teach UniScale everything from scratch. It's like taking a smart student who already knows how to draw (a pre-trained model) and giving them a quick lesson on "how to measure." It learns fast and uses less computer power.
Flexible: Sometimes a robot has perfect GPS data; sometimes it has nothing but a camera. UniScale works in both scenarios. If the GPS is there, it uses it to get super accurate. If not, it uses its "ruler" to make a very good guess.
Safe Navigation: Because it knows the actual size of things, a robot can safely navigate a warehouse, a hospital, or a forest without bumping into things or falling off cliffs.

In a Nutshell

UniScale is a robot's new superpower. It turns a flat, confusing stream of images into a measurable, 3D world where distances and sizes are real. It does this by being smart about how it uses extra information (like camera settings) and by having a dedicated "ruler" in its brain to ensure everything is built to the correct scale. It's the difference between a robot that sees a wall and a robot that knows exactly how far away that wall is.

1. Problem Statement

Robotic perception relies heavily on accurate 3D scene reconstruction for tasks like navigation, mapping, and interaction. While recent learning-based multi-view methods (e.g., VGGT, DUSt3R) have shown impressive performance using raw images, they face three critical limitations in real-world robotic deployment:

Scale Ambiguity: Most unified models produce scale-invariant or affine-invariant outputs, failing to recover the absolute metric scale of a scene, which is essential for physical interaction.
Rigid Architectures: Existing methods often struggle to flexibly incorporate known geometric priors (such as camera intrinsics and poses) when they are available, which is common in robotic systems.
Computational Cost: Training models from scratch to handle metric scale and diverse priors is computationally expensive and often requires large-scale retraining.

The goal is to develop a unified, feed-forward framework that jointly estimates camera parameters, depth, point maps, and metric scale, while allowing for the optional injection of geometric priors without requiring training from scratch.

2. Methodology: UniScale Architecture

UniScale is built upon the VGGT (Vision-Geometry-Transformer) backbone, extending it with a modular design to support metric reconstruction and prior injection.

A. Core Architecture

Backbone: Uses a large-scale transformer (DINOv2 based) to extract features. Images are patchified into tokens, concatenated with learnable camera tokens (for intrinsic/extrinsic estimation) and register tokens (for stability).
Aggregator: A global attention module processes cross-frame interactions, while a frame-level attention module handles intra-frame dependencies. This produces aggregated patch tokens and processed camera tokens.
Prediction Heads:
- Camera Head: Predicts intrinsics and extrinsics (rotation/translation).
- Dense Prediction Head: Uses a DPT head to predict scale-invariant depth maps and 3D point clouds.
- Metric-Scale Head: A dedicated module that estimates the absolute scene scale ( $S$ ).

B. Key Innovations

1. Semantic-Aware Prior Injection
Unlike previous methods that uniformly inject priors, UniScale employs a semantic-aware routing mechanism:

Pose Encoder: Encodes camera extrinsics (using a continuous 6D rotation representation to avoid quaternion discontinuities) and injects them specifically into camera tokens and the scale head.
Intrinsics Encoder: Encodes camera intrinsics as origin-free ray maps and injects them into patch tokens.
Benefit: This ensures geometric cues are routed to the most semantically relevant parts of the network, minimizing noise and improving convergence.

2. Dedicated Metric-Scale Head
To overcome the scale invariance of the backbone, UniScale introduces a specialized head that predicts the global scene scale ( $S$ ).

Input Fusion: It fuses three sources of information:
1. Class Tokens: Capture high-level global context.
2. Camera Tokens: Encode camera geometry.
3. Aggregated Patch Tokens: Capture inter-frame relationships.
Priors Integration: If available, pose and ray embeddings are integrated directly into this head to refine the scale estimate.
Output: The final metric depth and point clouds are obtained by multiplying the scale-invariant predictions by the predicted scale factor $S$ .

3. Training Strategy

Fine-Tuning: UniScale does not train from scratch. It initializes with pre-trained VGGT and DINOv2 weights, making it resource-efficient.
Probabilistic Prior Injection: During training, priors (intrinsics/poses) are randomly masked (injected with 50% probability) to ensure the model is robust to missing inputs.
Loss Functions: A multi-task loss combines camera loss (Huber), depth/point map loss (aleatoric uncertainty weighted), and a specific logarithmic scale loss ( $\ell_2$ on log-scale) to handle large magnitude variances.

3. Key Contributions

Unified Framework: A single feed-forward model that jointly performs camera calibration, depth estimation, point cloud generation, and metric scale recovery.
Modular Metric-Scale Head: A novel component that recovers real-world scale by fine-tuning globally learned features, overcoming the limitations of scale-invariant predecessors like VGGT.
Semantic-Aware Prior Injection: A mechanism that routes specific geometric priors (poses to camera tokens, intrinsics to patch tokens) based on semantic roles, improving robustness and accuracy.
Efficiency & Modularity: The model leverages pre-existing world priors from foundation models, avoiding the need for expensive training from scratch, and can be seamlessly integrated into other robotic perception pipelines.

4. Results and Benchmarks

UniScale was evaluated on multiple benchmarks, including Robust-MVD (KITTI, ScanNet), ETH3D, and ScanNet++.

Metric Depth & Reconstruction:
- On the Robust-MVD benchmark, UniScale achieves State-of-the-Art (SOTA) performance in multi-view metric prediction, outperforming MAST3R, MUSt3R, and MapAnything in several configurations (e.g., image-only and image+intrinsic settings).
- It demonstrates superior depth estimation accuracy (lower $rel$ error) compared to specialized methods.
Prior Injection Performance:
- When camera intrinsics and poses are provided, UniScale leverages them effectively, achieving SOTA results in median-aligned evaluations.
- The ablation studies confirm that injecting priors directly into the scale head is critical; removing this injection causes a significant performance drop.
Generalization:
- The model generalizes well to "in-the-wild" datasets (EuRoC, TUM RGBD, Oxford Spires), producing geometrically coherent reconstructions in diverse indoor and outdoor environments.
Ablation Insights:
- 6D vs. Quaternion: The 6D rotation representation outperforms quaternions in multi-view settings with many views ( $N \ge 8$ ) due to better continuity and optimization stability.
- Token Importance: Removing camera tokens or class tokens significantly degrades performance, validating the need for multi-source feature fusion in the scale head.

5. Significance

UniScale represents a significant step forward for robotic perception by bridging the gap between powerful foundation models and practical deployment requirements:

Metric Awareness: It solves the "scale ambiguity" problem, enabling robots to interact with the physical world using real-world measurements rather than relative geometry.
Flexibility: It operates effectively whether geometric priors are available (common in calibrated robots) or absent (unstructured environments), adapting its behavior dynamically.
Resource Efficiency: By fine-tuning pre-trained models rather than training from scratch, it offers a practical solution for robotic teams with limited computational resources.
Modularity: Its design allows it to serve as a drop-in upgrade for existing normalized reconstruction systems, transforming them into metric-aware tools.

In summary, UniScale provides a robust, unified, and efficient solution for 3D reconstruction that is specifically tailored to the needs of real-world robotic applications.