MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

The Big Idea: The "Magic Mirror" Test

Imagine you have a robot that is incredibly good at recognizing pictures. You show it a photo of a golden dragon, and it says, "That's a dragon!" But here's the problem: We don't know why it thinks that. Does it recognize the dragon because of its shape? Or is it just looking at the shiny gold texture?

This paper introduces a new tool called MRD (Metamers Rendered Differently). Think of MRD as a magic mirror that lets us peek inside the robot's brain to see what it actually "sees."

The Core Concept: "Model Metamers"

In human vision, a "metamer" is a trick of the light. Imagine you have a red apple. If you shine a specific mix of colored lights on a blue ball, it might look exactly like the red apple to your eye. Even though the ball is blue and the apple is red, they look the same. They are "metamers."

The researchers wanted to find Model Metamers. These are 3D scenes that look completely different to us (humans) but look identical to the AI.

The Analogy: The Master Chef and the Fake Ingredients
Imagine a Master Chef (the AI) who can identify a dish just by tasting it.

The Goal: We want to know if the Chef cares about the ingredients (shape) or just the spices (texture/material).
The Trick: We use a "magic kitchen" (MRD) to cook a fake dish. We take a bowl of jelly (different shape) and coat it in the exact same spices as the steak.
The Test: If the Chef says, "This is a steak!" when looking at the jelly, then the Chef doesn't actually understand what a steak is; they just recognize the spices.

How MRD Works: The "Reverse Engineer"

Usually, when we train AI, we show it pictures and it learns to guess what's in them. MRD does the opposite. It starts with the AI's "guess" (its internal brain activity) and tries to build a 3D world that would cause that exact guess.

Start with a Blank Canvas: The computer starts with a random 3D shape (like a sphere) and random materials (like plastic).
The "Magic" Camera: It uses a special camera that simulates real physics (how light bounces off surfaces).
The Feedback Loop:
- The camera takes a picture of the 3D scene.
- It feeds that picture to the AI.
- The AI says, "I see a dragon!"
- The computer checks: "Does my 3D scene look like a dragon to the AI?"
- The Adjustment: If the AI is confused, the computer uses math to slightly change the 3D shape or the material, trying to make the AI happier.
The Result: Eventually, the computer creates a 3D object that looks weird to us (maybe a spiky blob), but the AI is 100% convinced it is the original dragon.

What They Discovered: The "Texture Trap"

The researchers tested this on many different AI models to see if they understood Shape (geometry) or Material (texture/shine).

1. The Material Test (The "Shiny" Test)

Result: The AI was very good at this.
Analogy: If you ask the AI to recreate a "brushed metal" texture, it can build a perfect fake metal surface that looks exactly like the real thing to the AI, even if the shape is slightly off.
Why? AI models are great at recognizing patterns of light and color (like a shiny car).

2. The Shape Test (The "Dragon" Test)

Result: The AI struggled here.
Analogy: When asked to recreate the shape of a dragon, the AI often built a weird, spiky blob that looked nothing like a dragon to a human. But, the AI insisted, "No, that IS a dragon!"
The "Spiky Blob" Problem: This suggests that many AIs don't really understand the 3D structure of a dragon. They just associate "dragon-ness" with certain textures or patterns. If you give them a spiky blob with the right "dragon texture," they are fooled.

3. The "Shape-Biased" AI
The researchers also tested an AI that was specifically trained to care more about shapes (ResNet-SIN).

Result: This AI was much better at the shape test. It didn't get fooled by the spiky blobs as easily. It actually tried to build something that looked like a dragon.
Takeaway: We can teach AI to understand 3D shapes better by changing how we train them.

Why This Matters

This paper is like a lie detector test for AI.

For Computer Vision: It helps us build better AI that doesn't get tricked by bad lighting or weird textures.
For Human Vision: It helps us understand how our own brains work. If an AI and a human both get fooled by the same "magic trick," maybe our brains work in similar ways.

The Bottom Line

The authors built a tool that lets us "reverse engineer" what an AI sees. They found that while AI is amazing at recognizing textures and materials, it often fails to understand the true 3D shape of objects. It's like a chef who can identify a dish by its smell but can't tell you what the ingredients actually look like.

By using this tool, we can fix these blind spots and build smarter, more robust AI that understands the world more like we do.

1. Problem Statement

Deep learning vision models achieve high performance on 2D benchmarks, yet their internal representations of the underlying 3D world remain opaque. While models are often assumed to learn implicit 3D scene properties (e.g., shape, depth, material), standard explanation methods (like pixel-based gradient ascent or feature visualization) often produce "noisy" images that lack physical grounding. These methods struggle to disentangle specific physical causes (like geometry vs. material) because they optimize pixel values directly rather than scene parameters.

The core challenge is to determine what physical 3D scene parameters a model "sees" when it makes a decision. Specifically, can we find a physically different 3D scene that produces the exact same model activation as a target scene? Such scenes are known as model metamers.

2. Methodology: MRD (Metamers Rendered Differentially)

The authors propose MRD, a framework that combines Physically Based Differentiable Rendering (PBDR) with the concept of metamerism to probe vision models.

Core Concept

Instead of optimizing pixels, MRD optimizes scene parameters ( $\pi$ ) such as object shape (mesh vertices), material properties (BRDF/BSDF), lighting, and camera pose. The goal is to find a set of parameters $\pi'$ that, when rendered, produces an image $I'$ that yields the same latent representation in a target neural network as a ground-truth image $I$ (generated from known parameters $\pi$ ).

Technical Pipeline

Scene Initialization: A 3D scene is defined in a PBDR engine (Mitsuba 3). It includes a mesh, materials, environment maps, and camera positions.
Differentiable Rendering: The system uses a differentiable renderer (Mitsuba 3) to simulate light transport.
- Rendering Equation: The system solves the rendering equation using Monte Carlo integration (Path Tracing).
- Gradient Computation: To handle discontinuities (e.g., silhouettes, shadow edges) that break standard backpropagation, the method employs projective path replay backpropagation (Zhang et al.). This separates gradients into interior terms (smooth variations) and boundary terms (visibility changes), ensuring stable optimization.
Optimization Loop:
- Loss Function: The loss is computed between the latent representation of the rendered image and the ground-truth image (or a baseline reconstruction).
- Backpropagation: Gradients are computed with respect to the scene parameters ( $\pi$ ), not the pixels.
- Updates: An optimizer (Adam) updates the scene parameters.
- Constraints: Parameters are clamped to physically valid ranges (e.g., roughness $\in [0,1]$ ). For shape, "Large Steps" methods and dynamic remeshing (tessellation) are used to handle high-dimensional geometry.
Evaluation Metrics:
- Hyperspherical Similarity: Cosine similarity between normalized latent vectors to measure pointwise alignment.
- Representational Similarity Analysis (RSA): Measures whether the relational structure (pairwise similarities) of the reconstructed views matches the ground truth, ensuring the manifold geometry is preserved.
- Metamer Criterion: A reconstruction is a "model metamer" if its similarity score matches or exceeds a baseline established by reconstructing the ground truth scene using pixel-based loss (ensuring the model can distinguish the scene if the physics were perfect).

3. Key Contributions

MRD Framework: A novel method to probe vision models by finding 3D scene parameters that are physically distinct but functionally indistinguishable to the model (model metamers).
Physical Grounding: Unlike pixel-based synthesis, MRD outputs are grounded in physical scene descriptions (geometry, BRDF), allowing researchers to isolate specific variables (e.g., testing shape sensitivity while holding material constant).
Comprehensive Evaluation: The method is applied to six diverse architectures:
- CNNs: ResNet-50 (ImageNet) and ResNet-50-SIN (Stylized ImageNet, shape-biased).
- Perceptual Metrics: LPIPS (VGG backbone) and raw VGG.
- Transformers: CLIP and DINOv2.
Analysis of Invariance: The study quantifies which models are invariant to shape vs. material, revealing that some models (like ResNet-SIN) have tighter shape equivalence classes than others.

4. Results

Material Reconstruction (BSDF)

Performance: MRD was highly successful in reconstructing materials. Many networks (especially LPIPS, VGG, and ResNet-SIN) achieved metamerism (similarity $\ge$ baseline) for various materials (Brushed Metal, Diffuse, Aurora, Translucent).
Findings:
- LPIPS and ResNet-SIN showed the highest fidelity, often reaching near-perfect similarity with the baseline.
- CLIP and DINO showed higher variability; while some runs achieved metamers, others did not.
- Translucent materials were the most difficult to reconstruct due to complex light transport (subsurface scattering) requiring high sample counts and path tracing.
- Conclusion: Models generally encode material properties more robustly than shape, likely because material changes produce global, smooth gradients in the image, whereas shape changes create discontinuous boundaries.

Shape Reconstruction (Geometry)

Performance: Shape reconstruction was significantly more challenging. Only 15 out of 71 experiments met the strict metamer criterion.
Findings:
- LPIPS and VGG performed best for shape, often producing reconstructions visually similar to the target.
- ResNet and CLIP frequently failed to reach the metamer threshold. While their similarity scores were high, the visual reconstructions often appeared as "amorphous blobs" or spiky shapes that humans would not recognize as the target object (e.g., a dragon).
- RSA Insight: High RSA scores in ResNet experiments (even when visual similarity was low) suggest that ResNet's "dragon" class has a very wide equivalence class; it considers many geometrically distinct shapes to be functionally equivalent.
- ResNet-SIN: Despite being trained to be shape-biased, it still struggled to reconstruct fine geometric details perfectly, though it achieved high similarity scores.

Comparison of Material vs. Shape

The paper concludes that material reconstruction outperforms shape reconstruction because:

Dimensionality: Material space is lower-dimensional and more structured than high-dimensional shape space.
Gradient Quality: Material edits produce dense, coherent gradients; shape edits produce sparse, discontinuous gradients (due to silhouettes).
Model Bias: Vision networks may inherently encode texture/shading statistics more explicitly than geometric structure.

5. Significance and Implications

Decomposing Representations: MRD provides a rigorous tool to separate physical causes in model activations. It allows researchers to ask: "Does this model care about shape or texture?" by forcing the optimization to hold one constant while varying the other.
Understanding Equivalence Classes: The method reveals the "size" of a model's equivalence classes. For example, if a model accepts a spiky blob as a "dragon," it has a loose shape representation. If it requires a specific dragon mesh, it has a tight representation.
Human vs. Machine Vision: The results suggest that current deep learning models do not yet possess a human-like, shape-based semantic representation. Humans would likely generate a family of distinct "dragon" shapes; current models often converge to a single, often distorted, geometric solution or fail to converge on shape entirely.
Future Directions: The authors suggest MRD can be used to fine-tune models on specific scene properties, improve 3D reconstruction tasks, and serve as a diagnostic tool for the robustness of vision models against physical perturbations.

In summary, MRD bridges computer graphics and computer vision, using the physics of light to reverse-engineer what deep learning models "see" in 3D space, revealing that while models are excellent at recognizing material cues, their understanding of 3D geometry remains coarse and often physically implausible.