LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

Imagine you are looking at a single photograph of a busy street. You want to step "inside" the photo and walk around, seeing what's behind the parked cars or peeking around the corners of buildings. This is called Single-View View Synthesis.

The problem? A flat photo has no depth. It's like trying to guess the shape of a 3D object by only looking at its shadow. Most computer programs try to guess the depth by stacking invisible "sheets" (planes) in the air to rebuild the scene. If they guess wrong, the new view looks blurry, ghostly, or broken.

Enter LoLep (Locally-Learned Planes and Self-Attention Occlusion Inference). Think of LoLep as a master sculptor who doesn't just stack sheets randomly, but carefully carves them to fit the scene perfectly, using only that one photo.

Here is how LoLep works, broken down into simple analogies:

1. The Problem with "Random Sheets"

Previous methods (like MINE) tried to guess where to place these invisible sheets by randomly scattering them in the air.

The Analogy: Imagine trying to build a 3D model of a house by throwing 100 sheets of paper into the air and hoping they land in the right spots to form walls and a roof. Most will land in the wrong place, and you'll need thousands of sheets just to get a decent shape. This wastes a lot of computer power (memory).

2. The Solution: "Smart Local Search" (Locally-Learned Planes)

LoLep changes the game. Instead of throwing sheets randomly, it divides the space into specific "bins" (like drawers in a cabinet).

The Analogy: Imagine you have a cabinet with 16 drawers. Instead of throwing papers everywhere, LoLep says, "Okay, I know there is a wall somewhere in Drawer 3, and a tree in Drawer 7." It then asks the computer to find the exact spot within that specific drawer.
The Magic: This is called Locally-Learned Planes. By restricting the search to small, specific areas, the computer finds the perfect spot for each sheet much faster and with fewer sheets. This means LoLep can build a better 3D scene using fewer resources than its competitors.

3. The "Blind Spot" Problem (Occlusion)

When you move your camera in a photo, some things that were hidden (like the back of a car) suddenly appear, and things that were visible (like the front of the car) disappear. This is called occlusion.

The Problem: Old methods often get confused here. They might try to "paint" the back of the car using the texture of the front, creating a weird "ghost" or a twisted pole.
The LoLep Fix: LoLep uses a special Self-Attention Mechanism.
- The Analogy: Imagine a detective looking at a crime scene. Instead of looking at one clue in isolation, the detective looks at the whole room to see how clues relate to each other. "Ah, this shadow here means that object is blocking that wall over there."
- The Block-Sampling Trick: Usually, this "detective work" (Self-Attention) is too heavy for computers to do on large images (it requires too much memory). LoLep invented a Block-Sampling technique.
- The Analogy: Instead of the detective reading every single word in a 500-page book to find a connection, they read a few key paragraphs from different chapters. They get the same understanding of the story but finish the job 10 times faster. This allows LoLep to handle huge, high-quality images without crashing the computer's memory.

4. The "Teacher" (Occlusion-Aware Loss)

Since LoLep only has one photo to start with, it doesn't have a "correct answer" (depth map) to check against. How does it know if it's doing a good job?

The Analogy: Imagine you are trying to draw a map of a city from memory. You don't have a real map to check. So, you draw your map, then you try to "project" your drawing back onto the original photo. If your drawing says "there is a tree here," but the original photo shows a building, you know you made a mistake.
LoLep uses a Reprojection Loss to do exactly this. It checks if its 3D guess makes sense when projected back onto the 2D photo. If it sees a "ghost" (a mismatch), it learns to fix the geometry. It specifically ignores the parts of the image that are hidden (occluded) so it doesn't get confused by missing information.

Why is this a Big Deal?

Better Quality: LoLep creates sharper, more realistic new views. It doesn't leave you with blurry ghosts or twisted poles.
Efficiency: It achieves better results using fewer planes (sheets) than previous methods.
- Analogy: LoLep can build a perfect castle using 16 bricks, while the old methods needed 64 bricks to build a shaky one.
No Extra Tools Needed: Many other methods need a separate "depth camera" or a pre-trained depth detector to work. LoLep figures it all out from just the one RGB photo, making it more versatile.

In Summary

LoLep is like a smart, efficient architect. Instead of randomly guessing where to put walls (planes), it searches specific, logical spots. It uses a "detective" system to figure out what's hidden behind objects, and it does all this without needing a massive computer or extra depth sensors. The result? You can take a single photo and walk around in it with a level of realism that was previously impossible.

1. Problem Statement

Single-View View Synthesis (SVVS) aims to generate novel views of a scene from a single RGB image. This task is inherently ill-posed because it requires inferring 3D geometry and handling occlusions without explicit depth information.

Limitations of Existing Methods:
- Naive Representations: Methods using depth maps, voxels, or point clouds often fail to represent occluded regions accurately, leading to artifacts.
- Layered Representations (MPI/NeRF): Multiplane Image (MPI) and its continuous variant (MINE) are effective but typically rely on randomly sampling plane locations or globally learning them.
  - Random Sampling: Requires a large number of planes to achieve high quality, resulting in massive computational costs and memory usage.
  - Global Learning: Often requires an additional depth map input (from a pre-trained network) to guide the learning process, creating a dependency on external models and limiting generalization.
Core Challenge: How to accurately regress optimal plane locations for scene representation using only a single RGB image, without depth supervision, while maintaining computational efficiency and handling occlusions.

2. Methodology: LoLep

The authors propose LoLep, a framework that regresses Locally-Learned Planes to represent scenes accurately. The architecture consists of an encoder-decoder structure with three novel core components:

A. Disparity Sampler (Locally-Learned Planes)

Instead of random or global sampling, LoLep learns specific offsets for planes within predefined disparity bins.

Pre-partitioning: The disparity space $[d_f, d_n]$ is uniformly divided into $N$ bins.
Regression: A disparity sampler network predicts local offsets $\{v_i\}$ for each bin. The final plane location $d_i$ is calculated as:
$d_i = d_n + (v_i + i - 1) \frac{d_f - d_n}{N}$
This formulation ensures planes remain within their assigned bins, preventing the "clustering" issue seen in globally learned planes.
Optimization Strategies: To handle convergence issues caused by the lack of depth supervision, two strategies are proposed based on dataset disparity distributions:
1. U-opt (Uniform): For datasets with uniform disparity (e.g., KITTI), the encoder-decoder and sampler are optimized jointly.
2. A-opt (Aggregated): For datasets with aggregated disparity (e.g., Flowers Light Field), a two-stage training is used. First, the encoder-decoder is trained without the sampler; second, the sampler is introduced with a high learning rate while the encoder-decoder is fine-tuned with a low learning rate.

B. Occlusion-Aware Reprojection Loss

To compensate for the lack of ground-truth depth supervision, the authors introduce a geometric supervision technique.

Mechanism: Target pixels are projected back to the source view using the predicted depth.
Occlusion Masking: A pixel is marked as occluded if the projected depth in the source view differs significantly from the predicted depth ( $Z_s - \hat{D}_s \geq c \cdot s$ ).
Loss Function: The reprojection loss is computed only on non-occluded pixels, forcing the network to learn accurate geometry for visible regions while ignoring occluded areas that cannot be verified.
$L_{rep} = \frac{1}{HW} \sum |I_t - I_t^r| \cdot (\mathbf{1} - \mathcal{M}^o)$

C. Block-Sampling Self-Attention (BS-SA)

Standard self-attention mechanisms are computationally prohibitive for large feature maps (complexity $O(HW \times HW)$ ).

Solution: The BS-SA module reduces complexity by block-sampling $M$ query points from the feature map during each training step.
Process:
1. Transform features into Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ).
2. Randomly sample $M$ locations for $Q$ .
3. Compute attention matrix $A$ between sampled $Q$ and flattened $K$ .
4. Update features for sampled points; set un-sampled points to their original values.
Benefit: This allows the application of self-attention to large feature maps (e.g., $64 \times 192$ ) with significantly reduced memory usage, improving occlusion inference capabilities.

3. Key Contributions

Locally-Learned Planes: A novel disparity sampler that regresses accurate plane locations conditioned on a single RGB image, eliminating the need for external depth maps and reducing the number of planes required for high-quality synthesis.
Occlusion-Aware Supervision: A simple yet effective geometric loss that utilizes reprojection and occlusion masking to guide geometry learning without ground-truth depth.
Block-Sampling Self-Attention (BS-SA): A scalable attention mechanism that enables occlusion inference on large feature maps, overcoming the memory bottlenecks of standard self-attention.
State-of-the-Art Performance: The method achieves superior results across multiple datasets with fewer parameters and lower memory consumption compared to existing MPI-based methods like MINE.

4. Experimental Results

The method was evaluated on KITTI, RealEstate10K, and Flowers Light Fields datasets.

Quantitative Performance:
- LPIPS (Perceptual Similarity): LoLep reduced LPIPS by 4.8%–9.0% compared to MINE.
- Rendering Variance (RV): A new metric measuring the dispersion of rendering weights. LoLep achieved a massive reduction of 74.9%–83.5%, indicating that the learned planes concentrate rendering weights on fewer, more accurate depth layers, resulting in sharper images.
- Efficiency: LoLep with fewer planes (e.g., 16 planes) outperformed MINE with more planes (e.g., 32 or 64 planes) while using significantly less memory.
Qualitative Results:
- LoLep generates sharper images with fewer artifacts (e.g., ghosting, broken structures) and handles occlusions better than MINE.
- It successfully infers geometry for occluded regions (e.g., behind a pole or railing) where MINE fails.
Depth Evaluation: Even when trained only on RealEstate10K and tested on NYU-Depth V2 and iBims-1, LoLep produced significantly more accurate depth maps than MINE, demonstrating superior scene representation learning.

5. Significance

Paradigm Shift: LoLep moves away from the dependency on pre-trained depth estimators or random sampling, proving that accurate 3D scene representation can be learned end-to-end from a single image.
Efficiency: By learning optimal plane locations locally, the method drastically reduces the computational overhead required for high-fidelity view synthesis, making it more viable for real-time applications like AR/VR.
Robustness: The integration of occlusion-aware losses and self-attention mechanisms addresses the two biggest hurdles in SVVS: geometric accuracy in occluded regions and the computational cost of processing high-resolution features.

In conclusion, LoLep establishes a new state-of-the-art for single-view view synthesis by combining locally-learned geometric priors with efficient attention mechanisms, achieving higher quality results with lower resource costs.