MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting

The Big Problem: One Tool Doesn't Fit All

Imagine you are trying to film a chaotic scene: a chef chopping vegetables, a steak sizzling on a grill, and a flame flickering wildly.

In the world of computer graphics, we use a technique called 3D Gaussian Splatting to recreate these scenes from video. Think of it like building a 3D model out of millions of tiny, fuzzy, colored balloons (Gaussians) that float in space. When you move your camera, the computer rearranges these balloons to show you a new angle.

However, when things move (like the chef or the fire), the computer gets confused. The paper points out a frustrating reality:

Expert A is great at the smooth, slow movement of the chef's arm but terrible at the chaotic, fast flickering of the fire.
Expert B is amazing at the fire but makes the chef's arm look like a blurry smear.
Expert C is good at the steak but fails at the vegetables.

No single "Expert" (algorithm) can handle every part of the scene perfectly. It's like trying to use a single pair of scissors to cut paper, thread, and metal wire. You might get the job done, but the results will be messy.

The Solution: The "All-Star Team" (Mixture of Experts)

The authors, In-Hwan Jin and his team, decided to stop relying on just one expert. Instead, they built a Mixture of Experts (MoE) system.

Imagine a high-end restaurant kitchen. Instead of one chef trying to do everything, you have a team:

The Slicer: Specializes in smooth, precise cuts (good for the chef's arm).
The Fire Starter: Specializes in wild, unpredictable flames (good for the grill).
The Searer: Specializes in browning meat perfectly (good for the steak).

MoE-GS is the Head Chef (the Router). Its job isn't to cook the food; it's to decide who cooks what part of the dish.

How It Works: The Smart "Traffic Cop"

The magic of this paper lies in how the Head Chef makes decisions.

The Volume-Aware Pixel Router:
In older systems, the Head Chef might just look at the final picture and guess who should work. But that's like trying to fix a car engine by looking at the paint job.

MoE-GS uses a Volume-Aware Pixel Router. Imagine this router as a super-smart traffic cop standing inside the 3D world, not just looking at the 2D photo. It sees the "fuzzy balloons" (Gaussians) floating in 3D space.
- It sees a balloon near the fire and thinks, "Ah, this needs the Fire Starter!"
- It sees a balloon near the chef's hand and thinks, "This needs the Slicer!"
It then "splats" (projects) these decisions onto the final image. This ensures that the fire looks fiery and the hand looks smooth, all blended together seamlessly.
The "One-Pass" Trick (Efficiency):
Usually, if you have four experts, the computer has to render the scene four times (once for each expert) and then mix them. That's slow and heavy, like asking four painters to paint the same wall and then trying to blend their work.

The authors invented a Single-Pass Multi-Expert Rendering trick. They put all the "fuzzy balloons" from all four experts into one giant bucket and paint the wall once. The computer figures out which balloon belongs to which expert instantly. This makes the process much faster.
The "Pruning" Trick (Cleaning Up):
Sometimes, an expert tries to paint a part of the scene where they are useless (like the Fire Starter trying to paint the chef's hand). This creates clutter.

The system uses Gate-Aware Pruning. It's like a bouncer at a club. If a balloon (Gaussian) isn't contributing to the final image, the bouncer kicks it out. This keeps the scene clean and the computer running fast.

The "Teacher-Student" Strategy (Distillation)

There's one catch: Running a team of four experts is still heavier than running just one. To fix this for the future, the authors use a Knowledge Distillation strategy.

Think of the MoE-GS system as a Master Teacher. It has solved the problem perfectly by combining all the experts.

The Master Teacher then takes a Student (a single, lightweight expert model).
The Teacher says, "Look at this part of the image. I used the Fire Starter here. You try to learn how to do that part yourself."
The Student learns from the Teacher's "ghost" decisions.

Eventually, the Student becomes so good that it can recreate the high-quality result of the whole team, but it only takes up the space of a single expert. This means you get the high-quality video without needing a supercomputer to run it.

Why This Matters

Better Quality: It creates videos that look real, even when things are moving fast or chaotically.
Adaptability: It doesn't force one style of movement on the whole scene; it adapts to every tiny part of the video.
Future-Proof: By teaching the "Students" (distillation), this technology can eventually run on regular phones or laptops, not just massive servers.

In short: MoE-GS is like hiring a dream team of specialists, using a smart 3D traffic cop to assign the right work to the right person, and then teaching a single apprentice to do the whole job so well that you don't need the whole team anymore.

1. Problem Statement

Dynamic scene reconstruction aims to generate novel views of time-varying scenes from sparse observations. While 3D Gaussian Splatting (3DGS) has revolutionized static scene rendering by enabling real-time performance, extending it to dynamic scenes remains challenging. Existing dynamic 3DGS methods typically rely on a single deformation prior (e.g., MLP-based deformation, polynomial trajectories, or keyframe interpolation).

The authors identify three critical limitations in current state-of-the-art (SOTA) methods:

Scene-level Variability: No single method performs optimally across all diverse dynamic scenes.
Spatial-level Inconsistency: Within a single scene, different spatial regions (e.g., rigid objects vs. fluid motion) favor different deformation models.
Temporal Fluctuations: The optimal reconstruction method changes dynamically over time within the same video sequence.

Current approaches fail to adaptively select the best deformation model for specific spatiotemporal regions, leading to suboptimal reconstruction quality in complex real-world scenarios.

2. Methodology: MoE-GS

The paper proposes MoE-GS, the first framework to integrate Mixture-of-Experts (MoE) techniques into Dynamic Gaussian Splatting. Unlike MoE in Large Language Models (LLMs), which aims to reduce FLOPs via sparsity, MoE-GS aims to increase representational capacity by combining heterogeneous deformation priors to improve rendering fidelity.

Core Architecture

The framework operates in two stages:

Expert Training (Stage 1): Multiple specialized dynamic Gaussian models (experts) are trained independently. The paper utilizes a heterogeneous set of experts, including:
- 4DGaussians: HexPlane-based canonical deformation.
- E-D3DGS: Per-Gaussian volumetric deformation.
- STG: Polynomial trajectory modeling.
- Ex4DGS: Keyframe interpolation.
Router Training (Stage 2): With expert parameters frozen, a Volume-aware Pixel Router is trained to adaptively blend the outputs of these experts.

Key Innovation: Volume-aware Pixel Router

A naive router might assign weights at the pixel level (ignoring 3D structure) or at the Gaussian level (hard to optimize). MoE-GS introduces a novel Volume-aware Pixel Router that bridges these gaps:

Per-Gaussian Weights: Each Gaussian $G_i$ is assigned learnable weights encoding temporal and view-dependent variations ( $w_i, w_i^{dir}, w_i^{time}$ ).
Differentiable Weight Splatting: These 3D Gaussian-level weights are projected into 2D pixel space using the differentiable Gaussian rasterizer. This ensures the routing decisions are informed by volumetric geometry (depth, visibility, opacity) rather than just 2D pixel features.
Adaptive Blending: The router computes a softmax over the rasterized weights to generate per-pixel gating weights ( $G'_k$ ), which blend the rendered images from each expert:
$I_{MoE}(u) = \sum_{k=1}^{N} G'_k(u) \cdot I_{Ek}(u)$

Efficiency Strategies

Since running multiple experts increases computational cost, the authors propose two strategies to mitigate overhead:

Single-Pass Multi-Expert Rendering: Instead of rasterizing each expert separately (multi-pass), all Gaussians from all experts are merged into a single batch. A one-hot expert identity is assigned to each Gaussian, allowing projection and visibility calculations to be performed once, with expert-specific separation occurring only during alpha blending.
Gate-Aware Gaussian Pruning: The system accumulates the gradient of gating weights with respect to per-Gaussian weights. Gaussians with negligible influence on the final gating decision (low importance scores) are progressively pruned, reducing memory and rendering load without sacrificing fidelity.
Knowledge Distillation: To enable lightweight deployment, the authors train individual experts from scratch using the MoE-GS output as "pseudo-ground truth." The routing weights serve as confidence scores, guiding the distilled expert to specialize in regions where it is most reliable.

3. Key Contributions

MoE-GS Framework: The first application of Mixture-of-Experts to dynamic Gaussian Splatting, enabling adaptive reconstruction across diverse spatiotemporal dynamics.
Volume-aware Pixel Router: A novel routing mechanism that projects 3D Gaussian-level decisions into pixel space via differentiable splatting, ensuring spatial and temporal coherence.
Efficiency Mechanisms: Introduction of single-pass rendering and gate-aware pruning to manage the computational cost of multi-expert inference.
Distillation Strategy: A method to transfer MoE performance to single experts, allowing for high-quality, lightweight deployment without architectural changes.

4. Experimental Results

The authors evaluated MoE-GS on the N3V (Neural 3D Video) and Technicolor datasets, comparing against SOTA methods like 4DGaussians, E-D3DGS, STG, and Ex4DGS.

Quantitative Performance: MoE-GS consistently achieved State-of-the-Art (SOTA) performance.
- On N3V, MoE-GS (N=4) achieved an average PSNR of 33.27 dB, outperforming the best single expert (E-D3DGS at 32.33 dB).
- On Technicolor, MoE-GS (N=3) achieved 34.55 dB, surpassing the best baseline (Ex4DGS at 33.45 dB).
Qualitative Analysis: Visual results show that MoE-GS effectively handles complex motions (e.g., cooking, spinning objects) where single experts fail to capture sharp boundaries or coherent flows. The router successfully assigns different regions to the most suitable experts.
Efficiency:
- Pruning: Reducing the Gaussian count by 55% via pruning maintained PSNR within 0.02 dB of the full model while doubling FPS (from 44 to 83).
- Distillation: Distilled experts (e.g., E-D3DGS) trained with MoE supervision improved their PSNR by ~0.8 dB over their standard training, approaching MoE performance with a single-model inference cost.
Geometry Consistency: Post-hoc fusion of the MoE experts into a single Gaussian model demonstrated superior Multi-view Depth Consistency (MDC) compared to baselines, indicating that the mixture formulation improves underlying 3D geometric coherence, not just photometric fidelity.

5. Significance

MoE-GS represents a paradigm shift in dynamic scene reconstruction. It moves away from the "one-model-fits-all" approach, acknowledging that real-world dynamics are heterogeneous. By leveraging the complementary inductive biases of different deformation models (e.g., smooth trajectories vs. independent interpolation), MoE-GS achieves a level of robustness and fidelity previously unattainable.

The work is significant for:

AGI and Embodied AI: Providing high-fidelity, real-time dynamic scene representations crucial for agents interacting with changing environments.
Spatial Computing: Enabling immersive content creation where complex motions (fluids, deformable objects) are rendered realistically.
Methodological Advancement: Demonstrating that MoE architectures can be successfully adapted to explicit 3D representations, offering a scalable path forward for handling complex 4D data.