Scaling View Synthesis Transformers

Imagine you are an artist trying to paint a 3D scene, like a living room, but you've only been given a few photos of it from different angles. Your goal is to imagine and paint what the room looks like from a completely new angle you've never seen before. This is called Novel View Synthesis (NVS).

For a long time, artists (AI models) tried to build a perfect 3D blueprint of the room first, then paint from it. But this was slow and rigid. Recently, a new style of "artist" emerged: Transformers. These are AI models that look at the photos and guess the new view directly, without building a rigid 3D blueprint. They are amazing, but they are also incredibly hungry for computer power.

This paper is about teaching these AI artists how to work smarter, not harder.

Here is the breakdown of their discoveries, using some everyday analogies:

1. The Problem: The "Over-Engineered" Chef

The current top-performing AI model (called LVSM) works like a chef who, every time they need to serve a new dish (a new view), re-cooks the entire meal from scratch, even if they just made a similar dish five minutes ago.

How it works: To show you a view from the left, the model reads all the input photos. To show you a view from the right, it reads all the input photos again, processing them from the very beginning.
The result: It's accurate, but it's incredibly wasteful. If you want to generate 100 views, it does the heavy lifting 100 times.

2. The Solution: The "Master Chef" (SVSM)

The authors propose a new model called SVSM (Scalable View Synthesis Model). Think of this as a Master Chef who prepares a Master Broth (a scene representation) once, and then uses that single pot to serve any number of dishes instantly.

The Encoder (The Prep): The model looks at all the input photos once and creates a "summary" or "latent representation" of the scene.
The Decoder (The Service): When you ask for a new view, the model doesn't re-read the photos. It just dips a ladle into that Master Broth and serves your specific view.
The Benefit: If you want 100 views, the model only does the heavy lifting once. The rest is fast and cheap.

3. The Secret Sauce: The "Effective Batch Size"

You might think, "If the Master Chef is so efficient, why didn't anyone use this before?"

The authors found that previous attempts failed because they didn't know how to train the model correctly. They discovered a concept called Effective Batch Size.

The Analogy: Imagine you are training a student.
- Method A: Show the student 100 different rooms, but only ask them to draw one angle of each.
- Method B: Show the student just 10 rooms, but ask them to draw 10 different angles of each room.
The Discovery: The authors found that Method B is just as good for learning, but because the "Master Chef" (SVSM) can draw those 10 angles so quickly, it saves massive amounts of time and energy.
The Rule: It's not about how many rooms you see, but the total number of drawings you make. If you keep the total number of drawings constant, the learning outcome is the same, but the "Master Chef" does it much faster.

4. The "GPS" for 3D (PRoPE)

When the authors tried to scale this up to complex scenes with many input photos (like a panoramic view), the "Master Chef" got confused. It lost track of where the cameras were pointing.

The Fix: They added a special "GPS tag" to the data called PRoPE (Relative Camera Position Embeddings).
The Analogy: It's like giving the chef a map that says, "Photo A is to the left of Photo B." Without this map, the chef gets lost when looking at many photos at once. With the map, the model scales beautifully, handling complex scenes without getting dizzy.

5. The Results: Faster, Cheaper, Better

By combining the "Master Chef" approach with the "Effective Batch Size" training rule and the "GPS" tags, the authors achieved something incredible:

3x Efficiency: Their new model achieves the same (or better) quality as the previous state-of-the-art models but uses three times less computer power.
Real-Time Speed: Because it doesn't re-process the scene for every new view, it can generate new angles much faster, making it viable for real-time applications like VR or video games.
New Record: They set a new world record for image quality on standard benchmarks, beating models that rely on complex 3D geometry.

The Big Picture

This paper is a blueprint for the future of AI vision. It tells us that we don't need to build bigger, heavier, and more expensive models to get better results. Instead, we need to change how we train them and how we structure them.

By switching from a "re-do everything" approach to a "prepare once, serve many" approach, and by understanding that the total volume of practice matters more than the number of unique examples, we can build AI that sees the world clearly without burning a hole in our electricity bill.

1. Problem Statement

Novel View Synthesis (NVS) aims to render new views of a scene from arbitrary camera poses given a set of input images. While recent "geometry-free" approaches using pure transformer architectures (specifically the Large View Synthesis Model or LVSM) have achieved state-of-the-art (SOTA) results, they rely on decoder-only architectures.

The Bottleneck: Decoder-only models are bidirectional; they process all context images and the target view together in every layer. Consequently, rendering $V_T$ target views requires $V_T$ full forward passes through the network, leading to computational costs that scale linearly with the number of target views ( $O(V_T \cdot V_C)$ ).
The Gap: There is a lack of rigorous scaling laws for 3D vision transformers. It is unclear how architectural choices (encoder-decoder vs. decoder-only) and training strategies affect performance relative to compute budgets. Prior work suggested bidirectional attention was critical for high fidelity, but this was based on sub-optimal comparisons.

2. Methodology

The authors propose a systematic study of scaling laws to derive compute-optimal training recipes for NVS transformers.

A. Architecture: Scalable View Synthesis Model (SVSM)

The authors introduce SVSM, an encoder-decoder architecture designed to be compute-optimal:

Encoder: Processes the set of context images ( $C$ ) using a transformer encoder with bidirectional self-attention to produce a scene latent representation ( $z$ ). Unlike previous encoder-decoder attempts, SVSM avoids a fixed-size bottleneck; it uses the set of encoded context patch tokens as the representation.
Decoder: Uses unidirectional cross-attention to decode target views from the shared scene representation $z$ .
Efficiency: Because the scene representation is computed once, multiple target views can be decoded in parallel. This reduces the computational complexity from $O(V_T \cdot V_C)$ (in LVSM) to $O(V_T + V_C)$ for the MLP layers and $O(V_C \cdot (V_T + V_C))$ for attention, offering significant savings during inference and training when $V_T > 1$ .

B. The "Effective Batch Size" Hypothesis

A key theoretical contribution is the identification of Effective Batch Size ( $B_{eff}$ ) as the critical hyperparameter for NVS training.

Definition: $B_{eff} = B \times V_T$ , where $B$ is the number of scenes in a batch and $V_T$ is the number of target views rendered per scene.
Finding: Empirical results show that performance depends on the product $B \times V_T$ , not on $B$ or $V_T$ individually.
Implication: For the proposed SVSM, reducing the batch size $B$ while increasing $V_T$ (keeping $B_{eff}$ constant) reduces total training compute (FLOPs) because the expensive encoder pass is amortized over more target views. This allows SVSM to achieve the same performance with significantly less compute than decoder-only models.

C. Scaling Laws & Relative Camera Attention

The authors replicate the Chinchilla scaling law approach (balancing model size $N$ and data size $D$ ) for NVS:

Stereo (VC=2): SVSM scales as effectively as LVSM but achieves a Pareto frontier shifted by 3x in favor of compute efficiency.
Multiview (VC>2): Naive extension of SVSM to multiview settings caused performance saturation. The authors identified that Relative Camera Positional Embeddings (PRoPE) are essential. PRoPE encodes relative camera poses ( $g_i^{-1}g_j$ ) directly into the attention mechanism, allowing the model to handle complex camera trajectories without losing pose information through the encoder bottleneck.

3. Key Contributions

First Rigorous Scaling Analysis for NVS: The paper provides the first systematic study of scaling laws for view synthesis transformers, establishing relationships between compute, model size, and data.
Effective Batch Size Hypothesis: The authors prove that $B \times V_T$ is the true driver of performance, enabling compute-optimal training strategies that favor high $V_T$ and lower $B$ .
SVSM Architecture: A unidirectional encoder-decoder model that challenges the necessity of bidirectional decoding. It demonstrates that encoder-decoder models can outperform decoder-only models when trained with the correct scaling laws.
PRoPE for Multiview: The identification that relative camera attention is critical for scaling multiview transformers, resolving the saturation issue in encoder-decoder models.

4. Results

The authors evaluated SVSM on RealEstate10K (stereo) and DL3DV (multiview) benchmarks:

Compute Efficiency: SVSM achieves SOTA performance with 2x to 3x less training compute compared to the decoder-only LVSM.
Performance Metrics:
- On RealEstate10K (VC=2), the Pareto-optimal SVSM (416M params) achieved 30.01 PSNR and 0.096 LPIPS, outperforming the 171M parameter LVSM (29.67 PSNR, 0.098 LPIPS) despite using half the training FLOPs.
- SVSM surpassed previous geometry-based methods (e.g., pixelSplat, MVSplat, GS-LRM).
Inference Speed: SVSM renders significantly faster than decoder-only models.
- For VC=2, SVSM is ~2x faster.
- For VC=8, SVSM is 14x faster than the decoder-only LVSM because the encoder cost is amortized.
Scaling Behavior:
- SVSM follows Chinchilla-like scaling laws ( $N \propto \chi^{0.52}, D \propto \chi^{0.47}$ ).
- In multiview settings (VC=4), SVSM with PRoPE maintains a superior Pareto frontier, achieving +0.68 PSNR over LVSM at similar compute budgets.

5. Significance

This work fundamentally shifts the paradigm for training large-scale view synthesis models:

Debunking Bidirectional Necessity: It proves that bidirectional attention (decoder-only) is not required for high-fidelity NVS; unidirectional encoder-decoder models are actually more compute-optimal.
Training Recipe: It provides a concrete recipe for training NVS models: maximize the effective batch size ( $B \times V_T$ ) and utilize relative camera embeddings for multiview tasks.
Scalability: By demonstrating that SVSM scales linearly with compute while being 3x more efficient, the paper opens the door for training much larger, more generalizable 3D vision models that were previously too computationally expensive to train.
Real-World Impact: The ability to render multiple views in parallel with reduced compute makes these models more viable for real-time applications like VR/AR and autonomous driving.