Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

Imagine you are a digital artist trying to insert a new object—say, a shiny red apple—into an existing photo of a kitchen table.

If you just paste the apple in, it looks fake. It's floating in mid-air, it has no shadow, and the light hitting it doesn't match the light in the room. To make it look real, you need to solve two problems at once:

Relighting: You need to paint the apple so it looks like it's actually sitting under the kitchen lamp (shiny on top, dark on the bottom).
Shadow Casting: You need to draw a shadow on the table that matches the apple's shape and the lamp's angle.

The Problem with Old Methods
Previous AI tools tried to do this by guessing. They looked at the picture and said, "I think the light is coming from the left, so I'll draw a shadow there." But without understanding the physics of the scene, they often made mistakes. They would draw shadows that floated in the air, cast shadows in the wrong direction, or make the apple look like it was glowing from the inside. It was like trying to paint a realistic sunset without knowing where the sun actually is.

Other methods tried to build a full 3D model of the room first (like a video game engine), but that takes forever and is too heavy for everyday editing.

The New Solution: "Light-Geometry Interaction" (LGI)
This paper introduces a clever new trick called Light-Geometry Interaction (LGI) maps.

Think of the AI as a chef.

Old AI: Just looks at the ingredients (the photo) and guesses the recipe.
This New AI: Has a special "tasting spoon" (the LGI map) that tells it exactly how the light hits the ingredients based on their shape.

Here is how the "LGI Map" works, using a simple analogy:

Imagine you are in a dark room with a single flashlight. You hold up a ball.

The Depth Map: The AI first uses a standard tool to guess how far away every part of the table and the ball is. It's like having a rough 3D sketch.
The Ray Cast: Now, imagine the AI shoots invisible laser beams from every point on the ball toward the flashlight.
The "Occlusion" Check: The AI asks, "Does this laser beam hit the table before it hits the light?"
- If the beam hits the table first, that part of the ball is in shadow.
- If the beam hits the light directly, that part is lit.

The LGI Map is a special blueprint that records the results of these laser checks. It doesn't just say "shadow here." It says, "The light is blocked by the table at this specific angle and this specific distance."

Why is this a big deal?
By feeding this blueprint into the AI, the system stops guessing. It's like giving the artist a ruler and a protractor instead of letting them draw freehand.

No more floating shadows: The shadow knows exactly where the table is because the LGI map calculated the distance.
Perfect lighting: The apple knows exactly how bright it should be because the map calculated the angle of the light hitting it.
Complex interactions: It even handles tricky stuff, like a glass vase casting a shadow through a table, or a shiny metal ball reflecting the shadow of a chair onto itself.

The "ShadRel" Dataset
To teach the AI this new skill, the authors built a massive training library called ShadRel. Imagine a giant virtual studio with 800,000 different objects (glass, metal, wood, leather) and millions of different lighting setups. They used this to train the AI to master the art of shadows and light.

The Result
The paper shows that this method creates images that are incredibly realistic. Whether you are adding a person to a beach scene or a product to a store shelf, the shadows and lighting look like they belong there naturally. It bridges the gap between "magic AI generation" and "physics-based reality," making digital editing feel as natural as placing a real object on a real table.

In a nutshell:
They gave the AI a "physics cheat sheet" (the LGI map) so it can finally understand how light and shadows actually work, resulting in digital edits that look indistinguishable from reality.

1. Problem Statement

The paper addresses the challenge of joint shadow generation and object-level relighting in single-view image editing.

The Gap: Traditional physically based rendering (PBR) requires full 3D reconstruction and ray tracing, which is computationally expensive and impractical for single-view settings. Conversely, recent generative models (Diffusion, Bridge Matching) can synthesize shadows and lighting from RGB inputs but lack physical constraints.
The Failure Mode: Without explicit geometric priors, generative models often produce floating shadows, inconsistent illumination directions, and implausible shadow geometries, especially in complex scenes involving transparency, reflections, and inter-reflections.
The Goal: To create a unified framework that bridges geometry-inspired rendering with unconstrained generative modeling, enabling physically consistent shadow casting and relighting without requiring full 3D reconstruction.

2. Methodology

The authors propose a unified pipeline built upon Latent Bridge Matching, enhanced by a novel representation called Light-Geometry Interaction (LGI) maps.

A. Light-Geometry Interaction (LGI) Maps

LGI maps are the core novelty. They encode light-aware occlusion relationships derived from monocular depth predictions (2.5D geometry) rather than full 3D meshes.

Generation Process:
1. Depth Estimation: An off-the-shelf monocular depth estimator predicts depth $D$ .
2. 3D Lifting: 2D pixels are lifted to 3D space using the camera intrinsic matrix and depth.
3. Ray Sampling: Rays are cast from each 3D point toward the light source. Points along the ray are sampled and reprojected to the image plane to check for occlusion against the predicted depth map.
4. Elevation Difference Calculation: The method calculates the elevation angle difference ( $e_d$ ) between the light ray and the surface normal at sampled points.
5. Map Construction: Instead of a binary mask, LGI maps output a 3-channel representation:
  - $c_m^1$ : Minimum elevation difference (potential start of occlusion).
  - $c_m^2$ : Maximum elevation difference (potential end of occlusion).
  - $c_m^3$ : The value with the smallest absolute difference (most likely direct occlusion point).
Role: These maps serve as a differentiable, physics-inspired prior that explicitly couples illumination direction with scene geometry, constraining the generative model to produce geometrically consistent shadows.

B. Unified Pipeline (Bridge Matching)

The framework uses Latent Bridge Matching to transform a shadow-free image ( $x_0$ ) into a shadowed/relit image ( $x_1$ ).

Architecture: Based on a pretrained Stable Diffusion XL encoder-decoder. The encoder ( $E$ ) and decoder ( $D$ ) are frozen; only the bridge-matching network is trained.
Conditioning: The network is conditioned on:
1. Global Light Parameters ( $c_l$ ): Light color, radius, distance, intensity, azimuth, and elevation.
2. LGI Maps ( $c_m$ ): The 3-channel interaction maps derived above.
Training Objective: The model minimizes a mean squared error loss on the latent space drift, combined with a weighted pixel-level L1 loss that emphasizes regions of significant brightness change (shadow boundaries).

C. Image Harmonization Extension

The method extends to image harmonization (where light sources are implicit) by adding a light estimation network. This network predicts lighting parameters directly from the composited image. Since LGI maps are differentiable, the system can be trained end-to-end using shadow masks to supervise the light estimation.

3. Key Contributions

Light-Geometry Interaction (LGI) Maps: A novel 2.5D representation that bridges the gap between geometry-based rendering and generative models. It provides a compact, differentiable approximation of light transport suitable for end-to-end learning without full 3D reconstruction.
Joint Shadow-Relighting Pipeline: A unified framework that treats shadow generation and relighting as a coupled task. This allows the model to reason about direct illumination, secondary reflections, and inter-reflections simultaneously, avoiding the inconsistencies of disjoint approaches.
ShadRel Dataset: The first large-scale benchmark dataset (817K virtual objects) specifically designed for joint shadow and relighting tasks. It features:
- Physically accurate materials (glossy, metallic, transparent) using BSDF.
- Complex scenarios including soft shadows, reflections, and inter-reflections.
- Diverse lighting conditions (HDRI environments and point lights).

4. Experimental Results

The authors evaluated their method on synthetic and real-world benchmarks, comparing against State-of-the-Art (SOTA) methods like LBM (Latent Bridge Matching), CSG, and SwitchLight.

Quantitative Performance:
- On the ShadRel dataset, the proposed method significantly outperformed LBM in Overall RMSE (0.0334 vs 0.0417), Shadow Region BER (0.0588 vs 0.0847), and Object Relighting SSIM.
- On the DESOBAv2 (Image Harmonization) benchmark, the method achieved competitive global performance while showing superior accuracy in shadow regions compared to SGDGP and other SOTA harmonization models.
- On the CSG benchmark (clean-background shadow generation), the method achieved higher IoU and lower RMSE, demonstrating precise control over shadow shapes.
Qualitative Analysis:
- Realism: The method produces realistic, texture-aware shadows that align with object geometry and light direction, eliminating "floating shadow" artifacts.
- Generalization: Despite being trained solely on synthetic data, the model generalizes effectively to real-world images, including human portraits and complex objects with transparent or reflective materials.
- Complexity: It successfully handles multi-object insertion and multiple light sources by accumulating per-light contributions.

5. Significance and Impact

Efficiency vs. Accuracy: The method achieves physical consistency comparable to ray-tracing approaches but operates at the speed and efficiency of generative models, requiring only monocular depth (2.5D) rather than full 3D reconstruction.
Unified Task Modeling: By coupling shadow generation and relighting, the paper addresses a fundamental limitation in current image editing tools where these tasks are often treated separately, leading to physical inconsistencies.
Practical Application: The framework is highly relevant for virtual product placement, augmented reality (AR), and digital content creation, where realistic integration of objects into scenes with correct lighting and shadows is critical.
Resource Contribution: The release of the ShadRel dataset fills a critical gap in the community, providing a rigorous benchmark for training and evaluating coupled light transport models.

In conclusion, this work demonstrates that embedding physics-inspired geometric priors (LGI maps) into generative diffusion/bridge models is a powerful strategy for achieving high-fidelity, physically consistent image editing.

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

1. Problem Statement

2. Methodology

A. Light-Geometry Interaction (LGI) Maps

B. Unified Pipeline (Bridge Matching)

C. Image Harmonization Extension

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization