ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Imagine you have a very smart, well-trained artist named Depth Anything V2 (DA-V2). This artist is incredible at looking at a flat photograph and guessing how far away every object is. They can tell you that a car is far away and a person is close.

However, this artist has a few quirks:

They get confused by new styles: If they were trained mostly on pictures of dogs, and you show them a tiger, they might accidentally draw the tiger looking a bit like a dog.
They miss the fine details: Sometimes their depth map is a bit "blurry" or noisy, like a sketch that needs more shading to look real.

Re-Depth Anything is like a smart editor who steps in after the artist finishes their sketch, just before you show it to the world. Instead of asking the artist to start over (which takes too long and might ruin their style), the editor uses a special trick called "Re-lighting" to fix the mistakes instantly.

Here is how the magic happens, broken down into simple steps:

1. The "What-If" Game (Re-lighting)

Imagine the artist has drawn a 3D shape based on the photo. The editor takes this shape and says, "Let's pretend the sun is shining from the left, then from the right, then from above."

The editor shines virtual light on the 3D shape. If the shape is correct, the shadows will look natural. If the shape is wrong (like that tiger looking like a dog), the shadows will look weird and fake.

2. The "Art Critic" (The Diffusion Model)

Now, the editor brings in a super-smart Art Critic (this is the "2D diffusion model" mentioned in the paper). This critic has seen millions of photos of real tigers, cars, and faces.

The editor shows the critic the "re-lit" image (the photo with the new shadows).

The Critic says: "Whoa, that shadow on the nose looks fake! Real tigers don't cast shadows like that. The shape is wrong."
The Editor listens: The editor doesn't just guess; they use the Critic's feedback to nudge the artist's drawing.

3. The "Tweak, Don't Rewrite" Strategy

Here is the clever part. Usually, to fix a mistake, you might fire the artist and hire a new one, or make the artist redraw the whole picture from scratch. That's slow and risky.

Instead, Re-Depth Anything only tweaks the artist's internal notes (the "embeddings" and "decoder weights").

Think of it like an actor who memorized a script. If they stumble on a line, you don't replace them; you just whisper the correct line to them for this specific scene.
The artist keeps their general knowledge (they still know what a car is), but they adjust their specific guess for this photo to make the shadows look perfect.

4. The "Group Vote" (Ensembling)

Because the "Art Critic" is a bit random (it might say "fix the nose" one time and "fix the ear" the next), the editor runs this process 10 times with slightly different random lights. Then, they take the average of all 10 results. This ensures the final picture is stable and super sharp, removing any weird glitches.

Why is this a big deal?

No New Training: You don't need to teach the artist anything new. You just use the tools you already have.
Fixes "Out-of-Distribution" Errors: If the artist has never seen a specific type of object before, this method uses the "Art Critic's" general knowledge of the world to fix the shape.
Better Details: It turns a blurry, "dog-like" tiger into a sharp, realistic tiger with the right nose shape and fur texture, just by fixing how the light hits it.

In short: Re-Depth Anything is a test-time editor that uses a virtual light show and a super-smart art critic to polish a depth map, making it look more realistic and accurate without needing to retrain the original AI model. It's like taking a good sketch and adding the perfect lighting to make it look like a photograph.

Here is a detailed technical summary of the paper "Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting."

1. Problem Statement

Monocular Depth Estimation (MDE) remains a challenging task, particularly for out-of-distribution (OOD) images that differ significantly from the training data of foundation models like Depth Anything V2 (DA-V2).

The Limitation: While foundation models achieve state-of-the-art performance on standard benchmarks, they often struggle with real-world "in-the-wild" images, producing inaccurate geometries, biased shapes (e.g., mistaking a tiger for a dog due to training bias), or noisy flat surfaces.
The Gap: Existing Test-Time Adaptation (TTA) methods for single images often rely on scarce self-supervision cues (like temporal consistency in video) or require external priors (3D meshes). Purely self-supervised geometric reconstruction (e.g., Shape-from-Shading) is ill-posed and fails under complex real-world lighting and texture conditions.
Goal: To refine the depth predictions of a pre-trained feed-forward model (like DA-V2) for a specific input image without labeled data, correcting biases and enhancing fine details.

2. Methodology: Re-Depth Anything

The authors propose a test-time optimization framework that bridges the domain gap by fusing foundation models with 2D diffusion priors. Instead of optimizing the depth map directly or fine-tuning the entire network, the method uses a novel re-lighting strategy.

Core Workflow

Input & Initial Prediction: A single RGB image is fed into a pre-trained DA-V2 model to generate an initial disparity map ( $\hat{D}_{init}$ ).
Differentiable Re-lighting (Augmentation):
- The method converts the disparity map into a 3D mesh and computes surface normals.
- It synthesizes a re-illuminated image ( $\hat{I}$ ) by applying a Blinn-Phong shading model to the geometry.
- Crucially, the lighting conditions (direction, diffuse/specular intensity, exponent) are randomly sampled at each optimization step.
- The input image itself is used as a proxy for albedo (diffuse color), allowing the system to augment the image with new shading cues rather than attempting a full, physically accurate inverse rendering.
Self-Supervised Optimization via SDS:
- A 2D Diffusion Model (Stable Diffusion) acts as a prior to evaluate the "plausibility" of the re-lit image.
- The system uses Score Distillation Sampling (SDS) loss. The diffusion model scores how realistic the augmented shading looks given a text prompt (generated by BLIP-2 from the input image).
- This gradient is backpropagated to refine the depth geometry.
Targeted Optimization Strategy:
- To prevent optimization collapse and overfitting, the method does not optimize the depth tensor directly or fine-tune the entire DA-V2 network.
- Instead, it jointly optimizes:
  1. The intermediate feature embeddings ( $W$ ) fed into the DPT decoder (from the frozen ViT encoder).
  2. The weights of the DPT decoder ( $\theta$ ).
- This preserves the strong geometric knowledge encoded in the frozen encoder while allowing the decoder to adapt to the specific image content.
Ensembling: Due to the stochastic nature of SDS, the optimization is run multiple times (e.g., $N=10$ ) with different random seeds, and the final disparity map is obtained by averaging the results.

3. Key Contributions

Re-Depth Anything Framework: A novel test-time optimization method that adapts pre-trained depth models to real-world images using 2D diffusion priors on re-synthesized depth predictions, requiring no labeled data.
Single-Image Re-lighting: A differentiable rendering approach that links predicted depth to the input image via randomized shading. This replaces classical photometric reconstruction, avoiding the need to perfectly reconstruct appearance while still leveraging shading cues for geometry refinement.
Targeted Optimization Scheme: A strategy that optimizes only intermediate embeddings and decoder weights. This is shown to be critical for avoiding overfitting to image textures and preserving the global geometric structure learned during pre-training.
State-of-the-Art Performance: The method is validated on DA-V2 and extended to Depth Anything 3 (DA3), achieving new state-of-the-art results on multiple benchmarks.

4. Experimental Results

The method was evaluated on three standard benchmarks: CO3Dv2 (close-up objects), KITTI (autonomous driving), and ETH3D (indoor/outdoor scenes).

Quantitative Improvements:
- DA-V2 Baseline: Achieved consistent relative error reductions across all metrics. Notable gains include an 11.4% reduction in AbsRel on KITTI and 8.4% on ETH3D.
- DA3 Integration: When applied to the newer Depth Anything 3 (DA3), the method further improved performance, setting new state-of-the-art results. For example, on CO3D, it reduced Normal MSE by 14.65%.
Qualitative Improvements:
- Bias Correction: Successfully corrected shape biases (e.g., transforming a "dog-like" tiger prediction into a realistic tiger shape).
- Detail Enhancement: Added fine-grained details (e.g., threads on a ball, railing textures) and removed noise from flat surfaces.
- Robustness: Outperformed classical Shape-from-Shading (SfS) methods, which often fail due to strict albedo assumptions, and surpassed other TTA methods.

5. Significance and Impact

Bridging the Domain Gap: The paper demonstrates that foundation models can be effectively adapted to specific, unseen images at test time without retraining, significantly improving their robustness in real-world scenarios.
Generative Priors for Geometry: It establishes a new paradigm for using 2D generative models (diffusion) not just for image synthesis, but as a powerful supervisory signal for 3D geometric reasoning.
Efficiency: By freezing the encoder and only optimizing embeddings/decoder weights, the method is computationally efficient compared to full network fine-tuning or NeRF-based reconstruction.
Generalizability: The approach is model-agnostic, working effectively on different backbone architectures (DA-V2 and DA3), suggesting a general solution for refining monocular depth estimation.

In summary, Re-Depth Anything introduces a powerful, self-supervised mechanism to "polish" depth predictions by asking a generative AI model, "Does this 3D shape look realistic under random lighting?" and using that feedback to refine the geometry, effectively correcting errors and hallucinations in foundation models.

ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

1. The "What-If" Game (Re-lighting)

2. The "Art Critic" (The Diffusion Model)

3. The "Tweak, Don't Rewrite" Strategy

4. The "Group Vote" (Ensembling)

Why is this a big deal?

1. Problem Statement

2. Methodology: Re-Depth Anything

Core Workflow

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Equitable Multi-Task Learning for AI-RANs

SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning

The Temporal Markov Transition Field

SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models