P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

Imagine you are trying to send a massive, high-definition movie to a friend, but their internet connection is shaky. Sometimes it's fast, sometimes it's slow. You want them to be able to start watching immediately, even if the picture is a bit blurry, and then have the quality get sharper and sharper as more data arrives, without having to restart the video.

This is the problem of Scalable Coding. For decades, engineers have tried to solve this, but existing methods are either too heavy (like a giant, rigid file) or too complex (like a black box that you can't easily edit).

Enter P-GSVC, a new technology from researchers at the National University of Singapore. Think of it as a smart, layered painting system that uses "Gaussian Splats" (which are basically tiny, fuzzy, 2D ovals of color) to build images and videos.

Here is how P-GSVC works, explained through simple analogies:

1. The Old Way: The "Pile of Bricks" Problem

Imagine you are building a house out of bricks.

The Naive Approach: You build the whole house perfectly first, then you try to take away the "least important" bricks to make a smaller version for a poor internet connection.
- The Result: If you take away the wrong bricks, you don't just get a smaller house; you get a house with holes in the roof and missing walls. The image looks broken.
The Sequential Approach: You build the foundation (Layer 1) and freeze it. Then you build the second floor (Layer 2) on top of it.
- The Result: The foundation was built without knowing the second floor existed. When you add the second floor, the first floor doesn't quite fit right. The whole structure is wobbly, and the final house isn't as good as it could be.

2. The P-GSVC Solution: The "Team of Painters"

P-GSVC changes the game. Instead of building layers one by one or taking things away, it uses a Joint Training Strategy.

Imagine a team of three painters working on a giant mural:

Painter A (The Base Layer): They are told to paint the broad, blurry shapes of the scene (the sky, the mountains, the general outline of a person).
Painter B (The Enhancement Layer 1): They are told to add details like the color of the sky and the shape of the trees.
Painter C (The Enhancement Layer 2): They are told to add the tiny details, like leaves on the trees and the person's facial features.

The Magic Trick: In the old way, Painter A would finish, lock their door, and Painter B would come in later. In P-GSVC, all three painters work together at the same time.

They constantly check each other's work.
Painter A knows, "Oh, Painter B is going to add tree details, so I shouldn't paint the sky too dark, or it will clash."
Painter B knows, "Painter A is laying the groundwork, so I need to make sure my details fit perfectly on top of those shapes."

Because they are trained together (simultaneously), every layer is perfectly compatible with the others.

3. How It Works in Real Life

When you stream a video using P-GSVC:

Low Bandwidth: The system sends only Painter A's work. You see a low-resolution, slightly blurry version of the video. But crucially, there are no holes. The whole scene is there, just fuzzy.
Medium Bandwidth: The system sends Painter A + Painter B. The video gets sharper. The colors pop, and the shapes become clear.
High Bandwidth: The system sends Painter A + B + C. You get the full, crystal-clear, high-definition masterpiece.

4. Why Is This a Big Deal?

The researchers found that if you train the layers separately (the old way), the "painters" get confused. The math behind the scenes shows that the "loss" (errors) goes up and down wildly, and the final picture is stuck in a "local minimum"—a fancy way of saying it gets stuck in a "good enough" state and can't reach "great."

By training them together, P-GSVC:

Fixes the Holes: You never see a broken image, even at low quality.
Improves Quality: It makes the final image significantly better (up to 2.6 dB better in technical terms, which is a huge jump in visual quality) compared to the old methods.
Works for Both Photos and Videos: It treats a photo like a video with only one frame, so it's a universal solution.

The Bottom Line

P-GSVC is like upgrading from a rigid, one-size-fits-all file format to a smart, adaptive LEGO set. Whether your internet is a trickle or a firehose, you get a complete, coherent picture that gets better and better as more pieces arrive, without ever breaking the structure. It bridges the gap between old-school video compression and the future of AI-driven media.

Here is a detailed technical summary of the paper "P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video."

1. Problem Statement

The paper addresses the challenge of creating scalable, progressive representations for images and videos using 2D Gaussian Splatting (2DGS). While 3D Gaussian Splatting (3DGS) has shown success in 3D reconstruction, extending 2DGS to support scalable coding (where a base layer provides a coarse reconstruction and enhancement layers progressively improve quality/resolution) presents significant hurdles:

Interdependency of Splats: In standard 2DGS, splats are jointly trained to overfit the highest-fidelity input. They are highly interdependent; removing a subset of splats (even those with low individual contribution scores) to create a "coarse" layer often results in severe artifacts, such as holes and broken structures, because the remaining splats cannot reconstruct the scene without the removed ones.
Optimization Conflicts in Layer-wise Training: A naive approach to scalability is sequential layer-wise training (training a base layer, freezing it, then training enhancement layers on top). The authors demonstrate that this leads to cross-layer optimization conflicts. Because the objectives of the base layer (coarse structure) and enhancement layers (fine details) differ, training them sequentially causes unstable convergence, sharp gradient fluctuations, and entrapment in suboptimal local minima.
Limitations of Existing Methods: Current learning-based scalable codecs often rely on implicit neural representations (hard to edit/control) or sequential training strategies that fail to maintain scene integrity at lower bitrates.

2. Methodology: P-GSVC

The authors propose P-GSVC (Progressive Gaussian Splat Video Coding), a framework that organizes 2D Gaussian splats into a Base Layer ( $L_0$ ) and multiple Enhancement Layers ( $\Delta L_1, \Delta L_2, \dots$ ).

A. Layered Representation

Structure: The input video (or image) is represented by a union of Gaussian sets. The reconstruction at level $\ell$ is formed by the union of the base layer and the first $\ell$ enhancement layers:
$\hat{F}_\ell = \hat{F}_0 + \sum_{i=1}^{\ell} \Delta \hat{F}_i$
Rendering: Each level is rendered independently using a differentiable rasterizer. The base layer provides a complete but coarse scene, while enhancement layers add high-frequency details without modifying the base splats.

B. Joint Training Strategy (Core Innovation)

To solve the optimization conflicts and instability of sequential training, P-GSVC introduces a Joint Training Strategy:

Simultaneous Optimization: Instead of freezing lower layers, the model optimizes all layers simultaneously.
Cyclic Level Selection: In each training iteration, the model computes a joint loss that supervises two fidelity levels simultaneously:
1. The full reconstruction (all layers).
2. An intermediate reconstruction (base + a specific subset of enhancement layers).
Cyclic Switching: The target intermediate level is selected in a cyclic order (e.g., $L_1, L_2, L_1, L_2 \dots$ ) rather than randomly. This ensures that the gradient field remains stable during transitions between optimization objectives, preventing the model from overfitting to a specific layer or suffering from gradient spikes when switching targets.
Loss Function: The total loss is the sum of L2 losses across the selected levels:
$\mathcal{L}_t = \mathcal{L}_2(\hat{f}^L_t, f^L_t) + \mathcal{L}_2(\hat{f}^{\ell_k}_t, f^{\ell_k}_t)$
where $\ell_k$ is the cyclically selected intermediate level.

C. Video-Specific Mechanisms

For video, P-GSVC integrates mechanisms from GSVC to handle temporal redundancy:

Temporal Prediction: P-frames are initialized from the previous frame's splats.
Gaussian Splat Pruning (GSP): Removes low-contribution splats to control bitrate.
Gaussian Splat Augmentation (GSA): Injects new splats to capture dynamic motion or scene changes.
Dynamic Key-frame Selection (DKS): Detects scene cuts to insert new I-frames.

D. Quantization

The framework employs temporal-aware quantization (similar to GSVC) to compress parameters:

I-frames: All parameters quantized directly.
P-frames: Parameter differences relative to the reference frame are quantized.
Techniques include reduced floating-point precision for positions, asymmetric quantization for Cholesky vectors, and Vector Quantization (VQ) for colors.

3. Key Contributions

First Scalable 2DGS Framework: P-GSVC is the first framework to provide a unified, layered progressive solution for both scalable image and video coding using 2D Gaussian splats.
Joint Training Strategy: The authors identify that sequential layer-wise training fails due to optimization conflicts and propose a joint, cyclic training strategy. This aligns optimization trajectories across layers, ensuring stable convergence and high-quality intermediate reconstructions.
Performance Gains: The joint training strategy significantly outperforms sequential methods, achieving up to 2.6 dB PSNR improvement for images and 1.9 dB PSNR improvement for videos compared to state-of-the-art sequential baselines (like LIG and standard GSVC layer-wise training).
Artifact-Free Scalability: Unlike naive pruning methods that leave holes in low-bitrate reconstructions, P-GSVC maintains scene integrity at all levels, allowing for seamless progressive decoding.

4. Experimental Results

The paper evaluates P-GSVC on the Kodak and DIV-HR image datasets and the UVG video dataset.

Image Scalability:
- Compared to LIG (a sequential 2DGS baseline), P-GSVC achieves ~2.0–2.6 dB higher PSNR and significantly better MS-SSIM and LPIPS scores across all Gaussian budgets.
- Visual results show P-GSVC eliminates the "holes" and structural artifacts seen in pruning-based baselines.
Video Scalability:
- Quality Scalability: P-GSVC outperforms the Sequential method (which freezes layers) by nearly 2 dB at the highest quality levels. While the Sequential method's base layer is strong, its enhancement layers fail to improve quality due to optimization conflicts. P-GSVC's joint training allows enhancement layers to add significant value.
- Resolution Scalability: P-GSVC demonstrates superior normalized quality scores (VMAF, MS-SSIM) across different resolutions compared to sequential training.
- Rate-Distortion: While P-GSVC has a slight quality overhead (approx. 1.1 dB) compared to a non-scalable "Monolithic" upper bound (where each level is trained independently), it significantly narrows the gap between standard scalable codecs (like SHVC) and sequential 2DGS methods.

5. Significance and Impact

Bridging the Gap: P-GSVC bridges the gap between classical scalable codecs (like HEVC-SHVC) and modern neural codecs. It offers the explicit, editable nature of Gaussian splats with the scalability required for adaptive streaming.
Adaptive Delivery: The framework enables efficient delivery over heterogeneous networks and devices. A client can decode only the base layer for a low-resolution preview on a mobile device or decode all layers for high-fidelity rendering on a desktop, all from a single bitstream.
Paradigm Shift: The work challenges the assumption that progressive coding must rely on sequential training or implicit representations. It proves that explicit primitives (splats) can be effectively optimized for scalability through joint training, opening new avenues for real-time, scalable 3D/2D media transmission.

Limitations & Future Work:
The current encoding time is high (approx. 720 seconds per frame) due to iterative optimization, though rendering is real-time (~1200 fps). The authors note that parallelization on multi-GPU setups could significantly accelerate encoding, making it viable for offline video-on-demand scenarios.