Revisiting Shape from Polarization in the Era of Vision Foundation Models

Imagine you are trying to figure out the shape of a clay sculpture just by looking at a single photograph of it. This is a classic puzzle for computers called "Shape from a Single Image."

For a long time, computers have been getting really good at this by studying millions of photos. These "super-smart" computer brains (called Vision Foundation Models) are like students who have read every book in the library. They are incredibly accurate, but they are also expensive, slow, and hungry. They need massive amounts of data and huge computer power to learn.

Then, there's an older, more physics-based approach called Shape from Polarization (SfP). This method uses a special camera that sees how light "bounces" off surfaces (polarization). It's like having a pair of special glasses that reveal the texture and angle of a surface just by how the light hits it. Theoretically, this should be a superpower. But in reality, these older methods have been struggling, often performing worse than the big, hungry computer brains.

The Big Question:
Why is the "special glasses" method failing? Is the glasses broken? Or is the student just not studying the right way?

The Paper's Big Discovery

The authors of this paper say: "The glasses aren't broken; the training was just fake."

They found that previous attempts to teach computers to use these special glasses failed because of two main problems:

The "Plastic Toy" Problem (Fake Data):
Imagine trying to teach a chef to cook a real steak by only showing them plastic toy steaks. The plastic toys look okay, but they don't have the right texture or heat. Previous datasets used simple, computer-generated 3D shapes with random, mismatched textures. The computer learned to recognize the plastic, not the real world.
- The Fix: The authors built a new "kitchen" using 1,954 real-world 3D scanned objects (like actual statues and toys) and created 40,000 high-quality training scenes. This is like feeding the chef real, high-quality ingredients instead of plastic toys.
The "Perfect World" Problem (Ignoring Noise):
In the computer simulations, the camera was perfect. But in the real world, cameras get grainy, blurry, and noisy. The special "polarization" signal is very sensitive to this noise. Previous methods trained on "perfect" data, so when they saw a "noisy" real-world photo, they got confused.
- The Fix: The authors taught their model to expect imperfections. They artificially added blur, grain, and noise to the training images before processing the polarization data. It's like training a pilot in a simulator that includes storms and turbulence, so they don't panic when they fly in real bad weather.

The Secret Sauce: DINOv3

To make the model even smarter without needing a massive brain, they gave it a "cheat sheet" from a pre-trained AI called DINOv3. Think of DINOv3 as a student who has already memorized the general shapes of the world. By letting this student help the new model, the new model learns much faster and needs far less data to become an expert.

The Amazing Results

The results are like a magic trick:

Speed & Size: Their new model is 8 times smaller and 33 times faster to train than the giant "Vision Foundation Models."
Performance: Despite being smaller and trained on much less data, it beats the giants. It reconstructs shapes more accurately than the massive models that require millions of images.
Efficiency: They proved that using the "special glasses" (polarization) allows you to get top-tier results with a tiny fraction of the resources.

The Catch (Limitations)

It's not perfect yet.

Scene vs. Object: The model is great at looking at a single object (like a dinosaur figurine), but it gets confused if you show it a whole room with walls and furniture. It's like a sculptor who is amazing at making a single statue but doesn't know how to design a whole house.
The "Fuzzy" Problem: If an object is very fuzzy or white (like a baseball), it doesn't reflect polarized light well. In these cases, the "special glasses" get noisy, and the model reverts to guessing like a normal camera.

The Bottom Line

This paper is a wake-up call. It tells us that in the era of massive AI, we don't always need to build bigger, hungrier models. Sometimes, the answer is to combine physics (the laws of light) with smart data training. By using the right "glasses" and teaching the AI with realistic, noisy data, we can build small, fast, and incredibly accurate tools that don't need a supercomputer to run.

In short: Don't just feed the AI more data; feed it better data and give it the right tools to see the world.

1. Problem Statement

The paper addresses the challenge of single-shot surface normal estimation (recovering 3D geometry from a single 2D image).

The Context: Recent Vision Foundation Models (VFMs), both discriminative (e.g., MoGe) and generative (e.g., StableNormal), have achieved state-of-the-art results using massive RGB datasets. However, they are computationally expensive, data-hungry (requiring millions of samples), and often suffer from slow inference times (diffusion-based).
The Gap: Traditional Shape from Polarization (SfP) methods, which leverage the physical relationship between light polarization and surface geometry, have historically underperformed compared to modern VFMs.
The Hypothesis: The authors argue that the poor performance of prior SfP methods is not due to the polarization modality itself, but rather due to domain gaps caused by:
1. Unrealistic Training Data: Existing synthetic datasets use limited 3D objects with simple geometry and random, non-geometry-consistent textures.
2. Sensor Noise Modeling: Synthetic data lacks the specific noise characteristics (shot noise, lens blur, quantization) of real-world polarization sensors, which severely degrade polarization signals (especially the Angle of Linear Polarization, AoLP).

2. Methodology

The proposed solution is a lightweight, end-to-end learning pipeline that integrates polarization cues with modern deep learning priors.

A. High-Quality Synthetic Dataset (DTC-p)

To address the lack of diversity and realism in training data, the authors constructed DTC-p:

Source: Rendered 40,000 scenes using 1,954 high-quality 3D-scanned real-world objects from the Digital Twin Catalog (DTC).
Texture Consistency: Unlike previous datasets that applied random textures, DTC-p uses geometry-consistent textures, ensuring the visual appearance matches the underlying shape.
Environment: Uses diverse environment maps from Poly Haven.

B. Polarization Sensor-Aware Data Augmentation

To bridge the synthetic-to-real gap, the authors introduced a specific augmentation pipeline that simulates real sensor degradation before polarization signal processing:

Inverse Stokes Recovery: Convert rendered Stokes vectors back into four linearly polarized images ( $I_0, I_{45}, I_{90}, I_{135}$ ).
Augmentation: Apply Gaussian blur (simulating out-of-focus) and Gaussian noise to these four raw images.
Quantization: Convert 16/32-bit rendered images to 12-bit to match the bit-depth of real polarization sensors (e.g., Sony IMX250MYR).
Re-computation: Re-calculate RGB, Degree of Linear Polarization (DoLP), and AoLP from the augmented images.

Key Insight: Augmenting before the Stokes calculation (Eq. 1 & 2) better mimics real-world noise distribution, particularly the noise concentration in AoLP regions with rapid direction changes.

C. Network Architecture

The model is a hybrid architecture combining a UNet with a frozen Vision Foundation Model:

Inputs: RGB ( $s_0$ ), DoLP, and AoLP.
Backbone:
- UNet: Processes all input channels (RGB + Polarization).
- DINOv3 Encoder: A pre-trained ConvNeXt (base) backbone, frozen, used to extract high-level semantic features from the RGB channel only.
Fusion: Features from the DINOv3 encoder are concatenated with UNet encoder features at multiple scales in the decoder.
Loss: Cosine loss between predicted and ground-truth normals.

3. Key Contributions

Performance Milestone: Demonstrated that a lightweight SfP model trained on a small dataset (40K scenes) can outperform both state-of-the-art SfP methods and large RGB-only VFMs (including MoGe2 and StableNormal) in single-shot normal estimation.
Data Efficiency: Proved that polarization cues allow for a 33× reduction in training data and an 8× reduction in model parameters while achieving superior accuracy compared to RGB-only counterparts.
Dataset & Augmentation Innovation:
- Introduced DTC-p, a high-fidelity dataset with 1,954 scanned objects and geometry-consistent textures.
- Proposed sensor-aware augmentation (pre-Stokes processing) to effectively model real-world sensor noise, a critical factor previously overlooked.
Comprehensive Ablation: Conducted extensive studies on model components, dataset scale, object diversity, and environment maps, revealing that object realism is as critical as dataset scale.

4. Experimental Results

Quantitative Performance:
- Achieved a Mean Angular Error (MAE) of 12.54° on average across three real-world datasets.
- Outperformed the best previous SfP method (SfPUEL) by 21% and the best RGB-only VFM (MoGe2) by 8%.
- Used only 0.45% of the training data required by MoGe2 (40K vs. 8.9M scenes).
Qualitative Performance:
- Successfully recovered fine geometric details and avoided the "over-smoothing" common in RGB-only VFMs.
- Avoided "texture copying" artifacts seen in previous SfP methods.
Efficiency:
- Inference speed: 27 FPS on a V100 GPU (vs. 0.6 FPS for StableNormal and 6 FPS for MoGe2).
- Model Size: The full model is significantly smaller than large VFMs; even a 34M parameter variant with polarization outperformed a 282M parameter RGB-only model on real data.
Robustness: The model generalized well to out-of-distribution objects (transparent/conductive materials) not seen during training, whereas ablated models (without polarization or DINO priors) failed significantly.

5. Significance and Conclusion

This paper fundamentally shifts the perspective on Shape from Polarization in the age of Foundation Models. It demonstrates that physics-based sensing (polarization) is not obsolete but is actually a highly efficient mechanism to reduce the computational and data burdens of deep learning.

Efficiency: By incorporating physical priors (polarization), the need for massive datasets and massive model sizes is drastically reduced.
Realism: The work highlights that the failure of previous SfP methods was due to poor data simulation, not the modality itself.
Future Direction: The study encourages a renewed focus on physics-based sensing modalities and suggests that combining them with foundation model priors is a viable path toward efficient, real-time, and high-fidelity 3D reconstruction.

Limitations: The current method is limited to object-level estimation (struggles with global scene context like walls) and opaque dielectric materials. It also struggles when polarization signals are extremely weak (e.g., nearly unpolarized white objects).