MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

Imagine you are trying to film a busy street scene with a bunch of cameras, but you have to send the video live to a computer that has to rebuild the 3D world in real-time. The computer needs to know not just what the objects look like, but how they are moving as time passes.

This paper introduces a new system called MoRGS (Motion Reasoning for Gaussian Splatting) to solve a specific problem: How do you make a computer understand real movement without getting confused by static background noise?

Here is the breakdown using simple analogies:

The Problem: The "Chasing Shadows" Mistake

Imagine you are trying to teach a robot to dance by showing it a video.

Old Methods (The Confused Robot): The robot sees a person walking past a tree. It doesn't have a clear idea of how the person moves. So, to make the video look right, the robot decides the tree must be moving slightly to the left to match the pixel changes. It tries to "chase the shadows" (the pixel changes) rather than understanding the actual dance.
The Result: The 3D reconstruction looks okay for a second, but then it starts flickering and glitching because the robot is moving the wrong things (the static tree) and not moving the right things (the walking person) enough.

The Solution: MoRGS (The Smart Choreographer)

MoRGS is like hiring a smart choreographer who gives the robot three specific tools to understand the dance correctly.

1. The "Spotlight" (Sparse Optical Flow)

Instead of watching every single pixel in the video (which is too slow for live streaming), MoRGS picks a few key cameras and uses a "spotlight" (Optical Flow) to see exactly how pixels are moving in those specific views.

Analogy: Imagine a dance instructor only watching the lead dancers' feet to figure out the rhythm, rather than trying to track every single person in the crowd. This saves time but gives a strong hint about the direction of movement.

2. The "Correction Pen" (Motion Offset Field)

Sometimes, the "spotlight" from just a few cameras isn't perfect. It might look like a dancer is moving left, but from another angle, they are actually moving forward.

Analogy: The robot has a "Correction Pen." If the spotlight says "Move Left," but the 3D geometry says "That doesn't make sense," the robot uses the pen to tweak the movement slightly. It fixes the mistakes caused by looking at the scene from only a few angles, ensuring the movement makes sense in 3D space.

3. The "Volume Knob" (Motion Confidence)

This is the most important part. The robot needs to know: "Is this object actually moving, or is it just a static wall?"

Analogy: Imagine the robot has a volume knob for every single tiny dot (Gaussian) in the 3D world.
- If a dot is on a static wall, the knob is turned down to zero. The robot ignores it and doesn't waste energy trying to make it move.
- If a dot is on a running person, the knob is turned up. The robot focuses all its energy on figuring out exactly how that person moves.
Why this helps: It stops the robot from accidentally "wiggling" the background, which causes the flickering and glitches seen in older methods.

The Result

By combining these three tools, MoRGS creates a 3D video that:

Moves realistically: The people and objects move exactly as they do in real life.
Stays stable: The background (walls, trees) stays perfectly still and doesn't jitter.
Runs fast: Because it only focuses on the things that are actually moving, it can process the video in real-time, making it perfect for live streaming, VR, and AR.

Summary

Think of previous methods as a child trying to draw a moving car by smudging the whole picture to make it look like it's moving. MoRGS is like a professional animator who knows exactly which pixels belong to the car and which belong to the road, moving only the car while keeping the road perfectly still. This results in a much smoother, higher-quality, and faster 3D experience.

1. Problem Statement

The paper addresses the challenge of online (streamable) dynamic 3D scene reconstruction. While 3D Gaussian Splatting (3DGS) has revolutionized static scene rendering with real-time performance, extending it to dynamic scenes in an online setting (where frames arrive sequentially without access to future data) remains difficult.

Key Limitations of Existing Online Methods:

Implicit Motion Learning: Current online approaches (e.g., 3DGStream, QUEEN) rely solely on photometric loss (minimizing pixel differences between frames) to update Gaussian attributes.
Motion-Appearance Confusion: Without explicit motion cues, the model optimizes Gaussian positions to reduce pixel residuals rather than recovering true 3D motion. This causes "appearance chasing," where static Gaussians acquire redundant motion to explain texture changes, and truly dynamic Gaussians are underestimated.
Temporal Inconsistency: The lack of explicit motion constraints leads to flickering and unstable reconstructions in static regions and poor fidelity in large-motion areas.
Efficiency vs. Quality Trade-off: Adding dense optical flow for all views is computationally prohibitive for real-time streaming, yet ignoring motion cues leads to poor quality.

2. Methodology: MoRGS Framework

The authors propose MoRGS, a framework that explicitly models per-Gaussian motion using a combination of sparse motion cues, 3D refinement, and confidence-based weighting. The framework consists of three core components:

A. Flow-Guided Per-Gaussian Motion Learning

Instead of computing dense optical flow for all views (which is too slow), MoRGS computes optical flow only on a sparse set of key views.

Mechanism: It projects the 3D displacement of each Gaussian ( $\Delta \mu_{i,t}$ ) onto the image plane of these key views.
Loss Function: A flow-guided loss ( $L_{flow}$ ) aligns the rendered Gaussian motion map with the observed optical flow.
Benefit: This provides directional supervision, ensuring Gaussians move according to scene geometry rather than just minimizing photometric error, while keeping computational overhead low.

B. Per-Gaussian Motion Offset Field

Sparse flow supervision can lead to inconsistencies between views or mismatches with 3D geometry. To address this, MoRGS introduces a learnable motion offset field ( $O_{i,t}$ ).

Mechanism: The total motion is decomposed into a flow-guided base estimate and a learnable 3D offset: $\Delta \hat{\mu}_{i,t} = \Delta \mu_{i,t} + O_{i,t}$ .
Function: The offset field aggregates multi-view evidence to correct discrepancies where the sparse flow might be noisy or view-dependent. It acts as a spatial refinement term, ensuring geometric consistency across all views and time steps.
Regularization: An $L_1$ loss is applied to the offset to keep corrections small, preventing them from overpowering the flow-guided base motion.

C. Per-Gaussian Motion Confidence

To distinguish between static and dynamic regions and prevent redundant updates, the model learns a motion confidence score ( $m_i \in [0, 1]$ ) for each Gaussian.

Generation: 2D motion masks are generated from optical flow on keyframes. These masks are refined using SAM2 (Segment Anything Model 2) to ensure object-level consistency across views.
Application: The confidence score acts as a weight for Gaussian attribute residuals during updates: $A_{i,t} = A_{i,t-1} + m_i \odot R_{i,t}$ .
Benefit:
- Static Regions: Low confidence suppresses updates, preserving temporal stability and reducing flickering.
- Dynamic Regions: High confidence prioritizes updates, accelerating the modeling of large motions.

3. Key Contributions

Explicit Motion Reasoning: The first online framework to explicitly model per-Gaussian motion using sparse flow cues, moving beyond implicit appearance-matching proxies.
Hybrid Supervision Strategy: A novel combination of sparse optical flow guidance, a learnable 3D motion offset field, and a motion confidence mechanism that balances efficiency with high-fidelity motion recovery.
Temporal Consistency: By suppressing updates in static regions via motion confidence, the method significantly reduces artifacts and flickering compared to state-of-the-art (SOTA) online methods.
Streamable Performance: The design maintains low latency and memory footprint, making it suitable for real-time streaming applications.

4. Experimental Results

The method was evaluated on two dynamic scene datasets: N3DV (Neural 3D Videos) and Meet Room.

Quantitative Performance:

Rendering Quality: MoRGS achieves State-of-the-Art (SOTA) performance among online methods.
- On N3DV: Achieved 32.53 dB PSNR (vs. 31.67 dB for 3DGStream and 32.19 dB for QUEEN).
- On Meet Room: Achieved 31.79 dB PSNR (vs. 30.79 dB for 3DGStream).
Temporal Consistency: Measured by masked Total Variation (mTV) in static regions, MoRGS achieved the lowest values (e.g., 0.671 vs. 0.892 for 3DGStream on N3DV), indicating superior stability.
Efficiency:
- Training: ~3.4–4.0 seconds per frame (comparable to or slightly slower than QUEEN but with significantly higher quality).
- Rendering: ~200–215 FPS (real-time).
- Storage: Moderate increase compared to compressed methods like 4DGC, but justified by the quality gain.

Qualitative Observations:

Visualizations show that MoRGS correctly confines motion updates to dynamic objects (e.g., moving hands, torches) while keeping background static.
Competing methods (3DGStream, QUEEN) exhibit "ghosting" and redundant motion in static areas due to their reliance on photometric loss alone.
Spatiotemporal slices demonstrate that MoRGS produces sharper, more stable vertical lines over time compared to the noisy patterns of baselines.

5. Significance

MoRGS represents a significant step forward in online 4D reconstruction. By decoupling motion reasoning from pure photometric optimization, it solves the fundamental issue of "motion chasing" in streaming scenarios.

Practical Impact: It enables high-quality, real-time dynamic scene capture for AR/VR, telepresence, and immersive media without requiring future frame data or heavy offline processing.
Scientific Contribution: It demonstrates that sparse, lightweight motion priors (optical flow on key views) combined with learnable 3D refinement can effectively guide complex neural representations (Gaussians) to recover true physical dynamics, bridging the gap between efficiency and physical fidelity.