VeGaS: Video Gaussian Splatting

Imagine you have a video, like a movie clip of a dog running through a park. Traditionally, computers store this video as a stack of individual pictures (frames) played one after another. If you want to edit the video—say, make the dog twice as big or slow it down—you have to manually tweak every single picture, which is slow and often looks fake.

Newer methods use "neural networks" (AI brains) to learn the video as a smooth, continuous flow. This is great for compression (making the file small), but it's like trying to edit a smoothie: you can't easily pick out just the strawberry to make it bigger without ruining the whole drink.

VeGaS (Video Gaussian Splatting) is a new way to handle videos that combines the best of both worlds: it keeps the video small and smooth, but lets you edit it like a collection of individual, movable objects.

Here is how it works, using some simple analogies:

1. The Old Way vs. The New Way

The Old Way (3D Gaussian Splatting): Imagine you are trying to describe a 3D scene using thousands of floating, glowing fog balls (Gaussians). If the scene is static (like a statue), you just place the fog balls and you're done.
The Problem with Videos: If the statue starts dancing, the fog balls need to move. The previous best method (VGR) treated the video like a rigid puppet. It could stretch or slide the fog balls, but it couldn't make them twist, curve, or change shape in complex ways. It was like trying to dance with a stiff mannequin.

2. The Secret Ingredient: "Folded-Gaussians"

The authors of VeGaS invented a new type of fog ball called a Folded-Gaussian.

The Analogy: Imagine a piece of paper with a straight line drawn on it. That's a normal Gaussian. Now, imagine you crumple that paper, fold it, and twist it into a complex shape. That's a Folded-Gaussian.
Why it matters: Real life is messy. When a person waves their hand, their arm doesn't just move in a straight line; it curves and rotates. A normal fog ball can't capture that curve. A Folded-Gaussian is flexible enough to "fold" along the curve of the movement.
The Magic Trick: Even though the overall shape is twisted and complex, if you "slice" it at a specific moment in time (like looking at one specific frame of the video), it snaps back into a perfect, simple circle. This allows the computer to render a sharp, clear image for every single frame while still understanding the complex movement in between.

3. How VeGaS Edits Videos

Because VeGaS treats the video as a collection of these flexible, 3D fog balls rather than a stack of 2D pictures, editing becomes incredibly easy and realistic.

Global Changes: Want to make the whole video play in slow motion? You just slow down the "time" variable, and the fog balls flow naturally.
Object Manipulation: Want to make the dog in the video jump higher? You can grab the specific fog balls representing the dog and pull them up. Because the "folds" in the math understand the movement, the dog stretches and squishes realistically, just like a real object.
Frame Interpolation: If you want to add a new frame between two existing ones (to make the video smoother), VeGaS doesn't guess; it simply "slices" the folded fog ball at the exact middle point. The result is a perfect, natural-looking new frame.

4. The Results

The researchers tested VeGaS on many videos (like a bear, cows, and people breakdancing).

Quality: It recreated the videos with higher clarity (sharper details) than previous AI methods.
Editing: It allowed them to multiply objects (make two dogs out of one), scale them, or change specific frames without the video looking glitchy or blurry.

Summary

Think of VeGaS as upgrading from a stack of stiff cardboard cutouts (old video methods) to a bunch of magical, shape-shifting clay blobs (Folded-Gaussians).

Old Method: Good for storage, bad for editing.
VeGaS: Good for storage, and you can stretch, twist, and reshape the video content naturally because the underlying math is flexible enough to handle the "folds" of real-world motion.

It's like giving a video editor a set of superpowers to manipulate time and space within a video, all while keeping the file size small and the image quality high.

1. Problem Statement

The paper addresses the limitations of current video representation methods, specifically regarding the trade-off between reconstruction quality and editability:

Implicit Neural Representations (INRs): While effective for compressing video and generating continuous representations (mapping coordinates + time to RGB), INRs are generally unsuitable for editing. They treat the video as a "black box" function, making it difficult to isolate and modify specific objects or frames.
Existing Gaussian Splatting (3DGS) Approaches: Previous attempts to apply 3D Gaussian Splatting to video, such as the Video Gaussian Representation (VGR), allow for editing. However, VGR relies on linear transformations and translations to model motion. This restricts its ability to capture complex, nonlinear dynamics (e.g., rapid deformations, complex object interactions) often found in real-world video streams.

2. Methodology: VeGaS

The authors propose Video Gaussian Splatting (VeGaS), a framework that models video data as a 3D space where time is the third dimension. The core innovation lies in the introduction of Folded-Gaussians.

A. Folded-Gaussian Distributions

To overcome the linearity constraint of standard Gaussians, the authors introduce a novel family of distributions called Folded-Gaussians.

Concept: A standard 3D Gaussian models linear structures. A Folded-Gaussian generalizes this by applying a time-dependent transformation to the spatial variables.
Mechanism:
- Let $x = (s, t)$ represent space and time.
- The distribution conditions the spatial variable $s$ on the time variable $t$ .
- The conditional distribution $s|t$ is defined as a Gaussian with a mean shifted by a function $f(t)$ and a variance scaled by a function $a(t)$ .
- Formulation:
  $s|t \sim \mathcal{N}(m_s + f(m_t - t), a(t)\Sigma_s)$
- Here, $f$ is typically a polynomial function (learned coefficients) to capture nonlinear shifts, and $a(t)$ is a likelihood-based scaling function (derived from the time marginal distribution) to handle elements that appear only in specific time windows (e.g., objects entering and leaving the frame).
Result: While the marginal distribution of time and the conditional distribution of space are both Gaussian, the resulting joint distribution is non-Gaussian, allowing it to model complex, curved trajectories and nonlinear dynamics.

B. The VeGaS Architecture

Representation: The video is treated as a sequence of frames within a 3D space-time volume.
Modeling: Instead of storing independent Gaussians for each frame, VeGaS stores a set of 3D Folded-Gaussians.
Frame Rendering: To render a specific frame at time $t_i$ , the 3D Folded-Gaussian is conditioned on $t_i$ . This yields a 2D Gaussian distribution that represents the content of that specific frame.
Dynamic Frame Fitting: Unlike methods that assume fixed frame intervals, VeGaS learns the exact occurrence time of frames ( $t_k$ ) via a dynamic fitting function $f_t(k) = \sum \sigma(w_i)$ , optimizing the temporal alignment for better reconstruction.
Integration with MiraGe: The model utilizes the MiraGe representation (a 2D image extension of 3DGS) for the individual frames. This allows the use of "flat" Gaussians (triangles) to represent the 2D image plane, facilitating efficient rendering and editing.

C. Editing Capabilities

Because the video is represented by explicit geometric primitives (Gaussians) rather than a neural network weights, VeGaS supports:

Global Modifications: Scaling, multiplying, or moving objects across the entire video.
Local Modifications: Selecting a single frame or specific objects to modify without affecting the rest of the sequence.

3. Key Contributions

Folded-Gaussians: A novel mathematical family of distributions capable of modeling nonlinear structures in video streams while maintaining the property that conditional distributions (frames) remain standard Gaussians.
VeGaS Model: A complete framework for video processing that integrates Folded-Gaussians into the 3DGS pipeline, enabling high-fidelity reconstruction and realistic editing.
Dynamic Frame Fitting: An optimization procedure that learns the precise timing of frames rather than assuming uniform spacing, improving reconstruction accuracy.
State-of-the-Art Performance: Demonstrated superiority over existing INR and Gaussian-based video models in both reconstruction metrics and editing flexibility.

4. Experimental Results

The authors evaluated VeGaS on the DAVIS and Bunny datasets, comparing it against state-of-the-art baselines including Omnimotion, CoDeF, VGR, and various NeRF-based models (NeRV, E-NeRV, HNeRV, DNeRV).

Frame Reconstruction:
- VeGaS achieved the highest PSNR scores across all tested videos in the DAVIS dataset.
- In the DAVIS benchmark (Table 1), VeGaS-Full achieved an average PSNR of 31.08, outperforming VGR (28.44) and CoDeF (27.75).
- In the NeRF comparison (Table 2), VeGaS significantly outperformed all NeRF-based baselines, achieving an average PSNR of 32.42 compared to DNeRV's 29.66.
Frame Interpolation:
- Qualitative results (Figure 5) show that VeGaS produces smoother and more accurate interpolated frames between $t$ and $t+1$ compared to VGR, due to its ability to model nonlinear motion paths.
Editing:
- The paper demonstrates successful global scaling, multiplication, and local object manipulation (Figures 2 and 4), tasks that are difficult or impossible with INRs and limited in VGR.
Ablation Studies:
- Optimal performance was found with a batch size of 3, a polynomial degree of 7 for the function $f$ , and an initial 0.50M Gaussians.

5. Significance

The VeGaS model represents a significant advancement in the field of video representation learning. By successfully adapting the 3D Gaussian Splatting framework to handle nonlinear dynamics through Folded-Gaussians, it bridges the gap between high-quality video compression and flexible, object-level editing.

Practical Impact: It enables applications such as high-quality video editing, frame interpolation, and content manipulation without the need for complex mesh extraction or rigid linear assumptions.
Theoretical Impact: It introduces a new probabilistic distribution (Folded-Gaussian) that expands the capabilities of Gaussian Splatting beyond static or linearly deforming scenes, offering a new tool for modeling complex spatiotemporal data.

The code is publicly available at https://github.com/gmum/VeGaS.