Any Resolution Any Geometry: From Multi-View To Multi-Patch

Imagine you are trying to draw a incredibly detailed, high-resolution map of a city from a single, blurry photograph. You need to know not just where the buildings are (depth), but also which way the roofs and walls are facing (surface normals).

The problem is that most AI models today are like amateur cartographers. They are great at seeing the big picture (the general shape of the city), but when they try to zoom in to draw the tiny details (like a single brick or a thin wire fence), they get confused. They either blur the details to keep the whole map consistent, or they draw the details so sharply that the map falls apart and looks like a patchwork quilt with mismatched edges.

This paper introduces a new AI model called URGT (Ultra Resolution Geometry Transformer) that solves this problem. Here is how it works, using simple analogies:

1. The "Jigsaw Puzzle" Strategy (Multi-Patch)

Instead of trying to look at the whole 8K (super high-definition) image at once—which would overwhelm the computer's memory—the new method cuts the image into many smaller pieces, like a jigsaw puzzle.

Old Way: Previous methods would solve each puzzle piece in isolation. Piece A would be perfect, and Piece B would be perfect, but when you put them together, the edges wouldn't match. It looked like a messy collage.
The New Way (URGT): This model cuts the image into pieces, but then puts all the pieces on a giant table and lets the "detectives" (the AI) talk to each other. It uses a special Cross-Patch Attention mechanism. This is like having a team of experts where the person working on the "roof" piece can instantly ask the person working on the "wall" piece, "Hey, does this angle make sense?" This ensures that the whole picture stays consistent, even while fixing tiny details.

2. The "Rough Draft" Assistant (Coarse Priors)

Before the AI starts its detailed work, it doesn't start from scratch. It asks a "Rough Draft" assistant (existing, simpler AI models like Depth Anything V2) to give it a low-quality, blurry sketch of the depth and angles.

The Analogy: Imagine you are an artist. Instead of staring at a blank canvas, you are given a faint pencil sketch. Your job isn't to draw the whole thing from memory; your job is to refine that sketch. You take the rough lines and turn them into sharp, crisp, high-definition art. URGT takes these rough sketches and polishes them into 8K perfection.

3. The "Mix-and-Match" Training (GridMix)

One of the biggest challenges in AI is that high-resolution data is rare. To teach the model how to handle any size image, the researchers invented a training trick called GridMix.

The Analogy: Imagine you are training a chef to cook a banquet. Instead of only practicing on a 4x4 grid of ingredients, you randomly change the grid size every day. Sometimes you give them one giant ingredient; sometimes you give them a 2x2 grid; sometimes a 4x4 grid.
By randomly shuffling how the image is sliced up during training, the AI learns to be flexible. It doesn't get stuck on one specific pattern. This allows it to handle any resolution, from a small phone photo to a massive 8K billboard, without needing to be retrained for each size.

4. The "Global GPS" (Global Positional Encoding)

When the AI looks at a puzzle piece, it needs to know exactly where that piece belongs in the whole image.

The Problem: If you just look at a piece of a wall, you don't know if it's the top of the wall or the bottom.
The Solution: URGT gives every single pixel a Global GPS coordinate. Even though the AI is looking at a tiny 100x100 pixel square, it knows, "Ah, this square is at the top-left corner of the entire 8K image." This prevents the AI from getting lost and ensures that the "left side" of the image connects perfectly to the "right side."

Why Does This Matter?

The results are impressive. When tested on a massive 8K image (like a high-end movie poster), this new method:

Recovers Thin Structures: It can draw thin power lines, fence wires, and hair strands that other models miss or blur out.
Keeps Edges Sharp: The boundaries between objects are crisp, not fuzzy.
Works Everywhere: It works on real-world photos (like a street scene) even though it was mostly trained on computer-generated images.

In summary: URGT is like a master architect who takes a rough sketch, breaks the building down into manageable rooms, has the architects in each room talk to their neighbors to ensure the walls align perfectly, and then stitches it all back together into a flawless, high-definition 3D map. It's the first method that can handle "Any Resolution" and "Any Geometry" without losing its mind.

1. Problem Statement

High-resolution joint estimation of depth maps and surface normals is critical for holistic 3D scene understanding but remains a significant challenge. Existing methods face a fundamental trade-off:

Low-Resolution Models: Foundation models (e.g., Depth Anything V2, Metric3D V2) often operate at low resolutions due to memory constraints, failing to recover fine-grained details, thin structures, and high-frequency textures.
Patch-Based Refinement: Current high-resolution approaches (e.g., PatchRefiner) divide images into patches and refine them. However, they often process patches in isolation or with limited interaction, leading to inconsistent boundaries, visible seams, and a lack of global geometric coherence.
Joint Estimation Difficulty: Most high-resolution pipelines focus solely on depth, making it difficult to jointly estimate depth and normals with seamless continuity at scale.

The goal is to create a unified model that can process single high-resolution images (up to 8K) to produce globally consistent, high-fidelity depth and normal maps without compromising on fine details.

2. Methodology: Ultra Resolution Geometry Transformer (URGT)

The authors propose URGT, a unified multi-patch transformer that adapts the Visual Geometry Grounded Transformer (VGGT)—originally designed for multi-view 3D reconstruction—to a single-view, multi-patch setting.

Core Architecture

Input Processing: A high-resolution RGB image is partitioned into patches. Each patch is augmented with coarse priors (low-resolution depth and normals) generated by pre-trained foundation models (Depth Anything V2 and Metric3D V2).
Tokenization: The RGB patch, coarse depth crop, and coarse normal crop are encoded using DINOv2 to produce visual, depth, and normal tokens. These are summed to form a unified geometry-aware token representation.
Transformer Backbone: The model processes all patch tokens in a single forward pass using a shared backbone consisting of $L$ $L$ blocks.
- Intra-Patch Attention: Focuses on local feature refinement within each patch to preserve fine details.
- Cross-Patch Attention: Enables long-range information exchange across all patches, enforcing global geometric consistency and seamless transitions at patch boundaries.
Prediction Head: Lightweight DPT-style heads predict offsets relative to the coarse inputs. The final refined maps are calculated as:
$D_{refined} = D_{coarse} + \Delta_{Depth}$
$n_{refined} = n_{coarse} + \Delta_{Normal}$

Key Technical Innovations

Global Positional Encoding (Global RoPE):
Unlike standard patch-based methods that use local coordinates, URGT assigns every token a global spatial coordinate based on its position in the original high-resolution image. This allows the transformer to reason about spatial relationships across the entire image, ensuring that tokens from different patches align correctly.
GridMix Patch Sampling Strategy:
To address the scarcity of high-resolution training data and improve generalization, the authors introduce GridMix. During training, the model probabilistically samples different grid configurations ( $M \times M$ patches where $M \in \{1, 2, 3, 4\}$ ):
- $M=1$ : Single random patch.
- $M=2,3$ : Randomly sampled grids.
- $M=4$ : Fixed full-image grid.
  This acts as a powerful data augmentation technique, forcing the model to learn robust inter-patch consistency regardless of how the image is partitioned.
Geometrically Consistent Supervision:
The model is trained with a unified loss function that couples depth and normal estimation. A pseudo-normal field is derived from the ground-truth depth, and both heads are constrained by the same underlying 3D geometry. This ensures physical consistency between the predicted depth and surface orientation.

3. Key Contributions

Unified Multi-Patch Transformer: A novel architecture that treats patches as "virtual views," enabling global reasoning for single-image high-resolution geometry estimation.
GridMix Sampling: A probabilistic training strategy that enhances spatial robustness and generalization by exposing the model to diverse patch configurations.
Scalability to Arbitrary Resolutions: The method supports inputs from 4K to 8K and beyond without requiring resolution-specific training, offering a flexible solution for real-world applications.
Joint Depth-Normal Estimation: Successfully unifies the prediction of both depth and normals, leveraging their geometric coupling to improve accuracy and stability.

4. Experimental Results

The method was evaluated on the UnrealStereo4K dataset and several zero-shot benchmarks (Booster, ETH3D, Middlebury 2014).

State-of-the-Art Performance:
- Depth: On UnrealStereo4K, URGT reduced AbsRel from 0.0582 to 0.0291 and RMSE from 2.17 to 1.31 compared to PatchRefiner. It also achieved the lowest Consistency Error (CE) and improved boundary sharpness (PDBE).
- Normals: Reduced mean angular error from 23.36° to 18.51° compared to Metric3D V2.
Zero-Shot Generalization: The model demonstrated superior performance on unseen real-world datasets, outperforming both foundation models (Depth Anything V2) and patch-refinement baselines (PatchRefiner, PRO).
Qualitative Improvements: Visual results show significantly sharper boundaries, better preservation of thin structures (e.g., wires, boxes), and seamless transitions between patches, eliminating the "blocky" artifacts common in other methods.
Efficiency: The joint model processes a 4K image in approximately 0.97 seconds, offering a strong balance between quality and inference speed.

5. Significance

This paper represents a paradigm shift in high-resolution geometry estimation. By reinterpreting a single image as a set of patches and applying multi-view transformer principles (global attention, shared backbone) to a single-view context, URGT overcomes the limitations of both low-resolution foundation models and isolated patch-refinement methods.

The ability to generate globally consistent, high-fidelity geometry at arbitrary resolutions (up to 8K) makes this approach highly valuable for applications requiring precise 3D reconstruction, such as autonomous driving, robotics, AR/VR, and high-end visual effects, where fine details and global coherence are paramount.

Any Resolution Any Geometry: From Multi-View To Multi-Patch

1. The "Jigsaw Puzzle" Strategy (Multi-Patch)

2. The "Rough Draft" Assistant (Coarse Priors)

3. The "Mix-and-Match" Training (GridMix)

4. The "Global GPS" (Global Positional Encoding)

Why Does This Matter?

1. Problem Statement

2. Methodology: Ultra Resolution Geometry Transformer (URGT)

Core Architecture

Key Technical Innovations

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization