ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Re-Depth Anything is a test-time self-supervised framework that enhances monocular depth estimation by fusing foundation models with large-scale 2D diffusion priors to perform label-free refinement via generative re-lighting and Score Distillation Sampling, achieving state-of-the-art results without direct depth tensor optimization.

Ananta R. Bhattarai, Helge Rhodin

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you have a very smart, well-trained artist named Depth Anything V2 (DA-V2). This artist is incredible at looking at a flat photograph and guessing how far away every object is. They can tell you that a car is far away and a person is close.

However, this artist has a few quirks:

  1. They get confused by new styles: If they were trained mostly on pictures of dogs, and you show them a tiger, they might accidentally draw the tiger looking a bit like a dog.
  2. They miss the fine details: Sometimes their depth map is a bit "blurry" or noisy, like a sketch that needs more shading to look real.

Re-Depth Anything is like a smart editor who steps in after the artist finishes their sketch, just before you show it to the world. Instead of asking the artist to start over (which takes too long and might ruin their style), the editor uses a special trick called "Re-lighting" to fix the mistakes instantly.

Here is how the magic happens, broken down into simple steps:

1. The "What-If" Game (Re-lighting)

Imagine the artist has drawn a 3D shape based on the photo. The editor takes this shape and says, "Let's pretend the sun is shining from the left, then from the right, then from above."

The editor shines virtual light on the 3D shape. If the shape is correct, the shadows will look natural. If the shape is wrong (like that tiger looking like a dog), the shadows will look weird and fake.

2. The "Art Critic" (The Diffusion Model)

Now, the editor brings in a super-smart Art Critic (this is the "2D diffusion model" mentioned in the paper). This critic has seen millions of photos of real tigers, cars, and faces.

The editor shows the critic the "re-lit" image (the photo with the new shadows).

  • The Critic says: "Whoa, that shadow on the nose looks fake! Real tigers don't cast shadows like that. The shape is wrong."
  • The Editor listens: The editor doesn't just guess; they use the Critic's feedback to nudge the artist's drawing.

3. The "Tweak, Don't Rewrite" Strategy

Here is the clever part. Usually, to fix a mistake, you might fire the artist and hire a new one, or make the artist redraw the whole picture from scratch. That's slow and risky.

Instead, Re-Depth Anything only tweaks the artist's internal notes (the "embeddings" and "decoder weights").

  • Think of it like an actor who memorized a script. If they stumble on a line, you don't replace them; you just whisper the correct line to them for this specific scene.
  • The artist keeps their general knowledge (they still know what a car is), but they adjust their specific guess for this photo to make the shadows look perfect.

4. The "Group Vote" (Ensembling)

Because the "Art Critic" is a bit random (it might say "fix the nose" one time and "fix the ear" the next), the editor runs this process 10 times with slightly different random lights. Then, they take the average of all 10 results. This ensures the final picture is stable and super sharp, removing any weird glitches.

Why is this a big deal?

  • No New Training: You don't need to teach the artist anything new. You just use the tools you already have.
  • Fixes "Out-of-Distribution" Errors: If the artist has never seen a specific type of object before, this method uses the "Art Critic's" general knowledge of the world to fix the shape.
  • Better Details: It turns a blurry, "dog-like" tiger into a sharp, realistic tiger with the right nose shape and fur texture, just by fixing how the light hits it.

In short: Re-Depth Anything is a test-time editor that uses a virtual light show and a super-smart art critic to polish a depth map, making it look more realistic and accurate without needing to retrain the original AI model. It's like taking a good sketch and adding the perfect lighting to make it look like a photograph.