Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

Imagine you are holding a smartphone and recording a video of a windmill spinning in the wind. Now, imagine you want to take that video and walk around the windmill, seeing it from angles you never actually filmed. This is called Novel View Synthesis.

Doing this with a single camera (monocular video) is like trying to guess the shape of a 3D object just by looking at its shadow. It's incredibly hard because the computer doesn't know what's happening "behind" the scenes or how objects are twisting in 3D space.

This paper introduces a new method called SE3-BSplineGS (let's call it "The Smooth Motion Architect") that solves this problem much better than previous attempts. Here is how it works, explained with simple analogies.

1. The Problem: The "Stuttering" Dancers

Previous methods tried to animate 3D objects (represented as thousands of tiny, colorful clouds called Gaussians) by snapping them from one position to another.

The Analogy: Imagine a dance troupe where the dancers are told to jump from position A to position B instantly. If the camera moves fast, the dancers look like they are glitching or teleporting. Their movements aren't smooth; they are jerky.
The Result: When you try to look at the scene from a new angle, the image looks blurry or broken because the computer didn't understand the continuous path the object took.

2. The Solution: The "Flexible Wire" (SE(3) B-Spline)

The authors' big idea is to stop treating movement as a series of jumps and start treating it like a smooth, flexible wire.

The Analogy: Instead of snapping the dancers, imagine they are tied to a long, invisible, flexible wire (a B-spline). You only need to move a few "control knobs" (control points) on the wire, and the whole wire bends smoothly.
How it helps: This wire controls both where the object is (position) and which way it is facing (orientation). Because the wire is mathematically smooth, the object glides naturally through time, even if the camera is moving wildly. This prevents the "glitching" seen in older methods.

3. The "Smart Gardener" (Adaptive Control)

Not all parts of a video move the same way. A windmill blade spins fast and wildly, while the grass underneath barely moves.

The Problem: If you use the same number of "control knobs" for the whole scene, you either waste energy on the still grass or don't have enough knobs for the spinning blade.
The Solution: The method acts like a Smart Gardener.
- If a part of the scene is moving simply, the gardener prunes (removes) extra knobs to save computer power.
- If a part is moving chaotically (like the windmill), the gardener densifies (adds) more knobs to capture the complexity.
- This keeps the system fast but highly accurate where it matters.

4. The "Time-Travel Filter" (Soft Segment Reconstruction)

Sometimes, the video has long gaps between frames, or the object moves so fast that the computer gets confused about where it was a second ago.

The Analogy: Imagine trying to guess where a runner was 10 seconds ago based on where they are now. If you guess too far back, you might be wrong.
The Solution: The system uses a Soft Segment Filter. It says, "I trust the data from right now and just a moment ago the most. Data from 5 seconds ago? That's a bit fuzzy, so I'll lower my confidence in it."
This prevents the computer from trying to force a perfect match with old, unreliable data, which stops the image from turning into a blurry mess.

5. The "Magic Imagination" (Diffusion Prior)

The biggest challenge with a single camera is that you can't see the "back" of the object. The computer has to guess what's hidden.

The Analogy: If you only see the front of a car, you might guess the back looks like a sedan. But what if it's actually a truck? You need a reference.
The Solution: The authors use a Diffusion Model (the same AI technology that creates images from text) as a "Magic Imagination."
- They ask the AI: "Based on what I see here, what should the hidden parts look like?"
- The AI provides "hints" (cues) about the 3D shape, helping the system fill in the blanks without just copying the training video. This stops the system from "cheating" by just memorizing the video frames.

The Result

When you put all these pieces together, the result is a high-quality, 3D movie generated from a simple phone video.

Old methods: Look like a stop-motion animation with jerky, broken frames.
This method: Looks like a smooth, professional 3D movie where you can walk around the windmill, and it spins naturally, even though you only filmed it from one spot.

In short, they taught the computer to stop "jumping" objects in time and start "flowing" them, while using a smart gardener to manage the workload and a magic artist to imagine the unseen parts.

1. Problem Statement

The paper addresses the challenge of Novel View Synthesis (NVS) for dynamic scenes using only monocular video inputs.

Context: While 3D Gaussian Splatting (3DGS) has revolutionized static scene rendering, extending it to dynamic scenes from a single camera is difficult due to the lack of multi-view geometric cues.
Limitations of Existing Methods:
- Implicit Deformation: Many methods learn implicit deformation fields (canonical space to observation space) which often fail to guarantee continuous trajectories, leading to artifacts.
- Discontinuous Orientation: Some explicit trajectory methods (e.g., using Cubic Hermite Splines) model continuous position but fail to ensure continuous orientation (rotation), resulting in non-smooth pose variations and rendering artifacts, especially in complex motions.
- Overfitting: Monocular inputs often cause models to overfit to training views, failing to generalize to unseen viewpoints.

2. Methodology

The authors propose a framework that explicitly models continuous position and orientation deformations using SE(3) B-spline Motion Bases. The pipeline consists of four key components:

A. SE(3) B-spline Motion Bases

Instead of learning a dense deformation network, the method represents motion using a compact set of learnable control points.

Mathematical Formulation: It utilizes SE(3) Cumulative B-splines to represent the transformation $T(t)$ of dynamic Gaussians over time. This ensures mathematical continuity in both position and orientation (rotation).
Initialization: Control points are initialized from 3D tracklets (obtained via tracking) representing the pose state $Q = [R, t]$ .
Relative Transformation: The relative pose between adjacent tracklets is transformed into Lie algebra space ( $\xi = \log(\Delta Q)$ ) to compute the B-spline basis functions.

B. Adaptive Control Mechanism

To balance computational efficiency and modeling capacity, the system dynamically adjusts the number of motion bases and control points:

Pruning: Redundant control points are removed if their removal causes an error below a threshold ( $\epsilon_{prune}$ ), preventing overfitting.
Densification: In regions with complex motion (identified by high rendering errors and dynamic masks), new control points are added by copying and perturbing existing ones to increase local modeling capacity.

C. Soft Segment Reconstruction

To handle long time intervals between reference timestamps and observation timestamps (common in monocular videos), the authors introduce a Soft Segment Reconstruction strategy.

Mechanism: The opacity of dynamic Gaussians is adjusted based on the time difference ( $|t_{ref} - t_{obs}|$ ) using a sigmoid function.
Goal: Gaussians from distant time intervals contribute less to the current frame, mitigating interpolation errors and artifacts caused by large temporal gaps.

D. Diffusion-based Multi-view Prior

To combat overfitting to the single training view:

Approach: A multi-view diffusion model (Zero123) is used to generate scene priors for invisible regions occluded in the training view.
Optimization: A Score Distillation Sampling (SDS) loss is applied to align the rendered novel views with the diffusion model's priors, ensuring consistency in unseen areas.

3. Key Contributions

Explicit Continuous SE(3) Motion Representation: The first framework to explicitly model both continuous position and orientation deformations of dynamic Gaussians using SE(3) B-splines, eliminating orientation discontinuities found in previous spline-based methods.
Adaptive Control Mechanism: A dynamic system that prunes redundant control points and densifies motion bases in complex regions, optimizing the trade-off between rendering quality and computational cost.
Soft Segment Reconstruction: A novel strategy to reduce interference from long-interval motion deformations by adaptively weighting Gaussian opacity based on temporal proximity.
Diffusion-Augmented Generalization: The integration of multi-view diffusion priors via SDS loss to prevent overfitting and improve novel view synthesis quality in monocular settings.

4. Experimental Results

The method was evaluated on the iPhone and NVIDIA datasets against state-of-the-art (SOTA) methods like MoSca, HiMoR, SplineGS, and SoM.

Quantitative Performance:
- iPhone Dataset: Achieved 20.17 mPSNR, 0.729 mSSIM, and 0.274 mLPIPS, outperforming the second-best method (MoSca: 19.33 mPSNR).
- NVIDIA Dataset: Achieved 27.81 PSNR, 0.871 SSIM, and 0.049 LPIPS, significantly outperforming SplineGS (27.12 PSNR) and MoSca (26.76 PSNR).
Qualitative Results: Visual comparisons show superior reconstruction of fine details (e.g., windmill blades) and fewer artifacts in regions with complex motion compared to competitors.
Ablation Studies: Removing any component (Adaptive Control, Soft Segment, SDS Loss, or Camera Smoothness Loss) resulted in measurable performance drops, validating the necessity of each module.
Robustness: The method demonstrated tolerance to errors in 2D tracking priors, maintaining performance even with added noise.

5. Significance and Limitations

Significance: This work bridges a critical gap in dynamic 3D reconstruction by providing a mathematically rigorous, continuous representation of 6-DoF motion (position + rotation) from a single camera. It sets a new SOTA for monocular dynamic Gaussian Splatting, enabling high-fidelity, real-time rendering of complex dynamic scenes.
Limitations:
- Struggles with large non-rigid deformations (e.g., highly elastic objects).
- Performance degrades with blurry monocular videos caused by rapid camera or object motion.
- Relies on the quality of initial 2D tracking priors, though it is more robust than previous methods.

In conclusion, the paper presents SE3BSplineGS (implied by the code link), a robust framework that leverages explicit continuous motion modeling and diffusion priors to achieve high-quality dynamic view synthesis from monocular videos.