Monocular Normal Estimation via Shading Sequence Estimation

Imagine you are looking at a photograph of a shiny, complex object, like a ceramic teapot or a metal sculpture. You want to know exactly how its surface curves, where the bumps are, and how deep the crevices go, just by looking at that single flat picture. This is what computer scientists call Monocular Normal Estimation.

The "Normal Map" is like a secret code that tells a computer the direction every tiny point on the object's surface is facing. If you have this code, you can make the object look 3D, change the lighting, or even print it in real life.

However, existing computer programs have a big problem: they are great at guessing the colors of the shadows, but they often get the shape wrong. It's like a painter who can perfectly mix the right shade of gray for a shadow but draws the shadow in the wrong place. The result looks okay from a distance, but if you try to build the object in 3D, it falls apart. The paper calls this "3D Misalignment."

The Big Idea: Stop Guessing the Shape, Start Guessing the Light

The authors of this paper, RoSE, decided to change the rules of the game.

The Old Way (The "Direct Guess"):
Imagine trying to guess the shape of a mountain by looking at a single photo of it. You have to guess the steepness of every slope just by looking at the colors. It's hard because the differences in shape are often hidden in very subtle color changes.

The New Way (The "Shading Sequence"):
Instead of asking the computer to guess the shape directly, RoSE asks a different question: "If we shined a flashlight on this object from 9 different angles in a row, what would the shadows look like?"

Think of it like this:

The Old Way: Trying to describe a person's face by guessing the shape of their nose, eyes, and mouth all at once from a blurry photo.
The New Way: Asking, "If I shine a light on this person's face from the left, then the top, then the right, how does their shadow move?"

By watching how the shadows dance across the object as the light moves, the computer can figure out the exact shape much more easily. This is called a Shading Sequence.

How RoSE Works: The "Video Time-Traveler"

To solve this, the authors used a very clever trick. They realized that a sequence of shadows (light moving from angle A to angle B to angle C) looks a lot like a video.

The Magic Tool: They took a powerful AI model designed to generate videos (like the ones that turn text into movie clips) and repurposed it.
The Input: You give the AI a single photo of an object.
The Task: The AI acts like a time-traveling director. It imagines: "Okay, I'm going to move a giant ring of lights around this object. What does the object look like as the light sweeps over it?"
The Output: The AI generates a short "video" (a sequence of images) showing the object under these different lights.
The Math: Once the AI has this sequence of shadows, the computer uses a simple, old-school math formula (like solving a puzzle with a ruler) to instantly calculate the exact 3D shape.

Why This is a Game-Changer

The authors built a massive training library called MultiShade. Imagine a digital warehouse filled with 90,000 different 3D objects (from teddy bears to ancient statues), each covered in different materials (shiny metal, rough wood, soft fabric) and lit up in thousands of different ways. They taught the AI in this warehouse so it could handle anything you throw at it.

The Results:

Sharper Details: Unlike other methods that make objects look like smooth, melted plastic, RoSE keeps the fine details, like the wrinkles on a fabric or the sharp edge of a knife.
Better 3D: Because the AI focused on the physics of light and shadow first, the resulting 3D shapes actually fit together correctly. No more "floating" parts or weird bumps.
Generalization: It works on things it has never seen before, from real-world photos of squirrels and teacups to complex 3D models.

The Analogy Summary

Old Methods: Trying to guess the layout of a maze by looking at a single, static map. You might get the walls right, but the path is confusing.
RoSE: Instead of looking at the map, you send a drone through the maze with a flashlight, recording how the light hits the walls as it moves. By watching the light dance, you can reconstruct the maze perfectly.

The Catch (Limitations)

While RoSE is amazing, it's not magic yet:

It's a bit slow: Because it's using a heavy-duty video generator, it takes a few seconds to process an image, which might be too slow for real-time video games right now.
Transparency: It struggles with glass or see-through objects because light passes through them instead of casting clear shadows.
Darkness: If an object is in almost total darkness, the AI can't guess the shadows well.

Conclusion

RoSE is a new way of thinking about 3D vision. Instead of forcing computers to guess shapes directly, it teaches them to understand how light interacts with surfaces. By turning a single photo into a "shadow movie," it unlocks a level of 3D detail and accuracy that previous methods couldn't reach. It's a significant step forward for virtual reality, robotics, and digital art.

Here is a detailed technical summary of the paper "Monocular Normal Estimation via Shading Sequence Estimation" (RoSE), published at ICLR 2026.

1. Problem Statement

Monocular Normal Estimation aims to recover a surface normal map from a single RGB image of an object under arbitrary lighting.

The Core Challenge: Existing deep learning methods often suffer from "3D misalignment." While the predicted normal maps may look visually plausible (correct color distribution), the reconstructed 3D surfaces fail to align with the true underlying geometry. They often appear over-smoothed or lack fine-grained details.
Root Cause: The authors argue this stems from the current paradigm where models are trained to directly map RGB images to normal maps. Normal maps represent geometry in a highly compact form where geometric variations manifest only as subtle color differences. Deep models struggle to distinguish and reconstruct these fine details, especially when input geometry is ambiguous.

2. Methodology: RoSE

The paper proposes RoSE (Reformulating normal estimation as Shading sequence Estimation), a new paradigm that shifts the training target from direct normal prediction to shading sequence estimation.

A. The New Paradigm: Shading Sequence Estimation

Instead of predicting the normal map $\mathcal{N}$ directly, the model predicts a shading sequence $\mathcal{S}_s$ .

Definition: A shading sequence is a set of shading maps generated under a predefined set of canonical parallel light directions ( $L$ ).
Mathematical Equivalence: A shading map $S$ is defined as $S = \max(\mathbf{n} \cdot \mathbf{l}, 0)$ , where $\mathbf{n}$ is the normal and $\mathbf{l}$ is the light direction.
Sensitivity: Shading sequences are more sensitive to geometric variations than normal maps because geometry changes directly affect brightness (shading) rather than just color.
Reconstruction: Once the shading sequence is predicted, the normal map is recovered analytically using an Ordinary Least Squares (OLS) solver. The relationship is linear: $\mathbf{N} = (L^\top L)^{-1}L^\top \mathcal{S}_s$ . To handle the non-linearity of the $\max(0, \cdot)$ operation (clamping), the solver only uses shading values $>0$ .

B. Architecture: Image-to-Video Generative Model

Since a shading sequence can be viewed as a video (a sequence of frames corresponding to different light directions), RoSE leverages Image-to-Video (I2V) generative models (specifically based on SV3D).

Input: A single grayscale image of the object (chromatic information is removed to focus on geometric cues).
Conditioning: The generation is guided by:
1. CLIP Embedding: Provides global semantic context.
2. VAE Latent Concatenation: Preserves fine spatial details by concatenating the input image's latent representation with the noisy latent at each denoising step.
Output: A 9-frame video representing the shading sequence under 9 canonical ring lights.
Post-Processing: The generated video frames are decoded, and the OLS solver analytically computes the final normal map.

C. Dataset: MultiShade

To ensure robustness against complex materials and lighting, the authors curated MultiShade, a large-scale synthetic dataset:

Source: 90,000 3D models from Objaverse.
Diversity: Includes diverse shapes, materials (metallic, plastic, wood, fabric via MatSynth), and lighting conditions (parallel, point, and HDR environment lights).
Augmentation: Random material swapping and multi-view rendering (6 viewpoints per object) to prevent overfitting to specific textures or angles.

3. Key Contributions

New Paradigm: Reformulates monocular normal estimation as a shading sequence estimation task, addressing the "3D misalignment" issue by using a representation more sensitive to geometric variations.
RoSE Framework: Introduces a method utilizing image-to-video diffusion models to predict shading sequences under canonical lights, followed by an analytical OLS solver for normal recovery.
MultiShade Dataset: A diverse synthetic dataset with extensive material and lighting augmentation, significantly improving generalization to real-world scenarios.
State-of-the-Art Performance: Demonstrates superior performance on both synthetic and real-world benchmarks, particularly in preserving fine geometric details.

4. Experimental Results

The method was evaluated on standard benchmarks: DiLiGenT (parallel light), LUCES (near-field point light), and a curated MultiShade test set.

Quantitative Performance:
- DiLiGenT: RoSE achieved a Mean Angular Error (MAE) of 16.36°, outperforming the previous SOTA (NiRNE at 17.27°).
- LUCES: RoSE achieved an MAE of 14.48°, significantly beating the second-best method (Lotus-G at 17.44°).
- MultiShade: RoSE showed the best performance across all metrics, with a mean MAE of 15.37° and a high percentage of objects falling within tight error bounds (e.g., 26.99% under 3° error).
Qualitative Analysis:
- RoSE successfully reconstructs fine-grained geometric details (e.g., fur on a squirrel, ridges on a goblet) that other methods render as overly smooth or distorted.
- The method demonstrates robustness to complex materials (metallic reflections) and varying lighting conditions.
Ablation Studies:
- Grayscale Input: Using RGB input instead of grayscale degraded performance, confirming the benefit of removing redundant color information.
- Material Augmentation: Training without material augmentation led to worse results, validating the importance of the MultiShade dataset.
- Lighting Setup: A ring-light setup (9 lights) proved more effective than spiral or other complex paths.

5. Significance and Impact

Solving 3D Misalignment: By decoupling the geometric estimation from direct color mapping and utilizing the temporal/sequential consistency of video models, RoSE effectively resolves the 3D misalignment problem that plagues previous methods.
Leveraging Generative Priors: The work demonstrates that large-scale video generative models, trained on diverse data, contain rich lighting priors that can be repurposed for precise geometric estimation tasks.
Practical Applications: The high-fidelity normal maps produced by RoSE are critical for downstream tasks such as relighting, 3D mesh reconstruction, augmented reality (AR), and digital content creation, where accurate surface geometry is essential.
Efficiency: Despite using a video diffusion model, the analytical OLS step is computationally negligible, making the overall pipeline competitive in terms of inference time compared to other generative approaches.

Limitations: The method currently struggles with transparent/semi-transparent objects and extreme lighting conditions where large parts of the object are in deep shadow. Additionally, the reliance on video diffusion models introduces higher computational overhead compared to standard CNN-based estimators.