Decoupling Motion and Geometry in 4D Gaussian Splatting

Imagine you are trying to film a chaotic scene: a dancer spinning, a flame flickering, and a steak sizzling on a grill. Your goal is to create a 3D movie that you can watch from any angle, at any moment in time.

For a long time, computer scientists have used a technique called Gaussian Splatting. Think of this like building a scene out of millions of tiny, fuzzy, 3D "clouds" (Gaussians). Each cloud has a position, a color, and a shape. By layering these clouds, you can create a photorealistic image.

However, when you try to make these clouds move (like a 4D movie), the old method (called 4DGS) had a major flaw. It treated the cloud's shape and its movement as if they were glued together in a single package.

The Problem: The "Glued" Package

Imagine you are trying to describe a runner.

The Old Way (4DGS): You say, "The runner is a cloud that is always shaped like a sphere, and it moves in a straight line at a constant speed."
- The Issue: If the runner starts to twist, turn, or accelerate, the system gets confused. Because the shape and movement are glued together, trying to make the runner twist distorts their shape. The runner might suddenly look like a stretched-out blob or a jagged mess. This creates visual "glitches" or artifacts in the video.

The Solution: VeGaS (Velocity-based Decoupling)

The authors of this paper propose a new framework called VeGaS. Their big idea is to uncouple (unstick) the movement from the shape.

Think of it like a dance troupe:

The Dancers (Geometry): These are the clouds. Their job is to keep their specific shape (a sphere, a cube, a weird blob) and just stand there or wiggle slightly.
The Choreography (Motion): This is a separate script that tells the dancers where to go.

In VeGaS, they introduce two main tools to make this work:

1. The "Galilean Shearing" Matrix (The Flexible Choreography)

In physics, a "Galilean transformation" is a fancy way of describing how things move when you change your point of view. The authors use a mathematical trick called shearing.

The Analogy: Imagine a deck of cards. If you push the top of the deck to the right, the cards slide over each other, but the shape of each individual card doesn't change. They just shift position.
How it helps: VeGaS uses this to tell the clouds, "Move along this crazy, curvy path (non-linear motion) at varying speeds." Crucially, while the clouds slide along this path, their internal shape remains perfectly intact. This allows the system to handle complex movements like a spinning dancer or a flickering flame without the clouds getting distorted.

2. The "Geometric Deformation Network" (The Wiggle Room)

Sometimes, the object itself actually changes shape (like a muscle flexing or a flame changing form). The old system couldn't do this well because it was too busy trying to figure out the movement.

The Analogy: Now that the choreography is handled separately, the dancers have a special "wiggle network." This is a small AI brain that looks at the scene and says, "Okay, the flame is changing shape right now, so let's stretch this cloud slightly."
How it helps: This network refines the shape of the clouds independently of where they are moving. It ensures that if a steak is sizzling and changing shape, the system captures that detail without messing up the movement.

The Result: A Cleaner, Sharper Movie

By separating the "where" (motion) from the "what" (shape), VeGaS achieves two things:

No More Glitches: The clouds don't get stretched into weird shapes just because they are moving fast or turning corners.
Better Details: The system can capture fine details, like the individual flickers of a flame or the texture of a steak, much better than the previous methods.

Summary

If the old method was like trying to drive a car where the steering wheel and the engine were bolted together (making it hard to turn without stalling), VeGaS is like installing a modern transmission. It lets the engine (movement) and the steering (shape) work independently, resulting in a smooth, high-definition ride through time and space.

The paper proves this works by showing that VeGaS creates clearer, more realistic 4D videos of dancing, flames, and cooking steaks than any previous technology.

1. Problem Statement

Dynamic scene reconstruction aims to synthesize photorealistic images at arbitrary viewpoints and time instants. While 4D Gaussian Splatting (4DGS) has emerged as a leading method for this task by extending 3D Gaussians into the temporal domain, it suffers from a fundamental limitation: the coupling of motion and geometry.

In standard 4DGS, the 4D Gaussian covariance matrix is used to simultaneously model spatial position, shape, orientation, and velocity. This formulation imposes two restrictive assumptions:

Constant Velocity: Motion is modeled as linear with a fixed velocity vector derived from the covariance.
Time-Invariant Geometry: The 3D shape and orientation of the Gaussian are assumed to be static over time, independent of the temporal variable.

These constraints limit the model's ability to represent complex non-linear trajectories (e.g., swinging arms, flowing water) and non-rigid deformations (e.g., muscle movement, fabric wrinkles). Furthermore, because motion and geometry are coupled within a single covariance parameterization, optimizing one often interferes with the other, leading to visual artifacts and degraded reconstruction quality during complex motion fitting.

2. Methodology: VeGaS

The authors propose VeGaS (Velocity-based Decoupling of Motion and Geometry in 4D Gaussian Splatting), a framework that explicitly separates motion modeling from geometric modeling.

A. Motion-Geometric Decoupled Representation

The core innovation is the introduction of a Galilean Shearing Matrix to handle motion independently of geometry.

Galilean Shearing: Inspired by classical mechanics, the authors define a shearing matrix $V$ that maps the static temporal axis to a slanted trajectory in 4D spacetime. This matrix incorporates a time-varying instantaneous velocity $v(t)$ .
Congruence Transformation: The original 4D covariance $\Sigma$ is transformed via $\Sigma' = V \Sigma V^\top$ .
Theoretical Guarantee (Schur Complement Invariance): The authors prove that while this transformation alters the trajectory of the Gaussian center (enabling non-linear motion), the conditional 3D covariance (which determines the 3D shape and orientation at any specific time $t$ $t$ ) remains invariant.
- Mathematically, the conditional mean $\mu'_{xyz|t}$ includes the time-varying velocity term, allowing for complex trajectories.
- The conditional covariance $\Sigma'_{xyz|t}$ remains identical to the original 4DGS formulation, ensuring that the 3D shape is not distorted by the motion parameters.
Non-linear Trajectory Integration: To model $v(t)$ , the system uses a set of learnable velocity anchors sampled across the temporal domain. The instantaneous velocity is interpolated between these anchors, and the cumulative displacement is calculated via efficient numerical integration (trapezoidal rule with prefix sums).

B. Geometric Deformation Network

While the shearing matrix handles motion, complex scenes often require changes in the Gaussian's intrinsic shape (scaling, rotation) over time.

A lightweight Geometric Deformation Network ( $F_\theta$ ) is introduced to predict residuals for scaling ( $\Delta s$ ) and rotation ( $\Delta q$ ).
Inputs: The network takes the spatio-temporal context (canonical 3D center, temporal mean, query time) and crucially, the velocity features as input.
Output: It outputs residuals that update the Gaussian's scale and orientation, allowing the geometry to adapt to non-rigid deformations (e.g., muscle expansion) independently of the motion trajectory.

C. Rendering

The final rendering combines the motion-transformed 4D Gaussians (via the shearing matrix) and the geometry-transformed attributes (via the deformation network). These are rendered using differentiable rasterization to produce the final image.

3. Key Contributions

Decoupled Framework: VeGaS is the first 4DGS framework to strictly decouple motion and geometry, resolving the artifact issues caused by covariance coupling in previous methods.
Galilean Shearing for Motion: Introduces a novel motion modeling approach using a time-varying velocity parameterized by a shearing matrix, enabling flexible non-linear trajectory modeling without distorting 3D geometry.
Geometric Deformation Network: Proposes a dedicated network to model time-varying geometric attributes, enhancing expressiveness for non-rigid deformations.
State-of-the-Art Performance: Demonstrates superior performance on both synthetic and real-world datasets, achieving higher fidelity and fewer artifacts than existing SOTA methods.

4. Experimental Results

The authors evaluated VeGaS on two major benchmarks:

Neural 3D Video (Neu3DV): A multi-view real-world dataset.
- Quantitative: VeGaS achieved 32.68 PSNR and 0.98 SSIM, outperforming the previous best (4DGS at 32.01 PSNR) by a significant margin. It also reduced LPIPS (perceptual error) by over 10%.
- Qualitative: Visual comparisons showed VeGaS eliminated artifacts common in 4DGS, such as distorted backgrounds and blurred textures in complex scenes (e.g., flames, steak searing).
D-NeRF: A monocular synthetic dataset.
- Quantitative: VeGaS achieved 34.67 PSNR and 0.99 SSIM, surpassing all competitors including 7DGS and 4DGS.
- Qualitative: The method successfully reconstructed fine-grained details (e.g., armor ridges, mutant arm structures) that were blurred or missing in other methods.

Ablation Studies:

Removing the velocity component resulted in poor reconstruction of rigid motion trajectories.
Removing the geometric deformation network led to poor reconstruction of non-rigid deformations (e.g., flames).
The full model combining both components yielded the best results, confirming that both flexible motion and accurate geometric modeling are essential.

5. Significance

This paper addresses a critical bottleneck in dynamic scene reconstruction: the inability of current Gaussian Splatting methods to handle complex, non-linear motions and deformations simultaneously without introducing artifacts. By mathematically proving and implementing a decoupling of motion and geometry, VeGaS provides a more robust and expressive representation for 4D scenes.

The significance extends to applications requiring high-fidelity dynamic rendering, such as VR/AR, immersive gaming, and cinematic production, where realistic motion and deformation are paramount. The approach offers a new paradigm for temporal modeling in neural rendering, moving beyond the limitations of constant-velocity and time-invariant geometry assumptions.