StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

StreamSplat is a fully feed-forward framework that enables real-time, online reconstruction of dynamic 3D scenes from uncalibrated video streams into 3D Gaussian Splatting representations, achieving state-of-the-art quality with a 1200x speedup over traditional optimization-based methods through probabilistic sampling, bidirectional deformation, and adaptive Gaussian fusion.

Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are holding a smartphone, walking through a busy park, and recording a video. In the video, people are running, birds are flying, and the wind is rustling the trees.

The Problem:
Most current 3D reconstruction technologies are like a slow-motion, high-end film crew. To create a 3D model of that park, they need to:

  1. Stop the video.
  2. Know exactly where the camera was for every single frame (like having a GPS on the phone).
  3. Spend hours or even days crunching numbers on a supercomputer to figure out where every leaf and person is.
  4. Only then can they show you a 3D version.

This is great for movies, but useless for a robot trying to avoid a running dog right now, or for an AR app trying to overlay a game on your living room instantly.

The Solution: StreamSplat
The paper introduces StreamSplat, a new method that acts like a super-fast, intuitive artist who can watch your video and instantly "paint" a 3D world as you record it. It doesn't need to know where the camera is (uncalibrated), it doesn't need to stop and think for hours, and it works in real-time.

Here is how it works, broken down into three simple concepts:

1. The "Fuzzy Guess" (Probabilistic Sampling)

The Analogy: Imagine trying to catch a ball in the dark. If you guess exactly where it is, you might miss because you don't know the exact distance. But if you guess a cloud of possible locations where the ball might be, you have a much better chance of catching it.

The Tech: Usually, AI tries to guess the exact 3D position of an object in a video. But without knowing the camera's settings, this is like guessing in the dark. StreamSplat doesn't guess one single spot; it guesses a "cloud" of possibilities (a probability distribution). This helps the AI avoid getting stuck in "local minima" (thinking it found the right spot when it's actually wrong) and makes the 3D model much more robust, even with messy, uncalibrated video.

2. The "Two-Way Time Machine" (Bidirectional Deformation)

The Analogy: Imagine watching a dance.

  • Old way: You only watch the dancer move forward. If they trip, you don't know why until it's too late.
  • StreamSplat way: It watches the dancer move forward from the last second to the current second, and it imagines moving backward from the current second to the last one.

The Tech: By looking at the motion in both directions (forward and backward), the system creates a "safety net." If the forward view gets confused by a blur or an occlusion (someone walking in front of the camera), the backward view helps correct it. This prevents errors from piling up over time, ensuring the 3D model stays stable even in long videos.

3. The "Living Cloud" (Adaptive Gaussian Fusion)

The Analogy: Think of the 3D world as being made of millions of tiny, glowing balloons (called Gaussians).

  • Old way: When a new person walks into the frame, the computer has to stop, delete old balloons, and painstakingly place new ones. It's messy and slow.
  • StreamSplat way: It treats the balloons like a living cloud.
    • If a balloon is part of a tree that stays still, it persists (stays alive).
    • If a balloon belongs to a bird that flies away, it fades out gracefully.
    • If a new person walks in, new balloons emerge naturally.

The Tech: Instead of rigidly matching objects frame-by-frame, StreamSplat uses a "soft" matching system. It blends the old 3D data with the new video frame seamlessly. This allows the system to handle things appearing and disappearing (like a car driving out of view) without breaking the 3D model.

Why is this a Big Deal?

  • Speed: It is 1,200 times faster than previous methods. While others take hours to process a scene, StreamSplat does it in a fraction of a second.
  • No Setup: You don't need to calibrate your camera. You can use any video, from any phone, in any lighting.
  • Real-Time: It can run on a stream. As you record, the 3D world is being built instantly.

In Summary:
StreamSplat turns a flat, 2D video stream into a living, breathing 3D world instantly. It does this by making smart "fuzzy" guesses, checking its work in both forward and backward time, and letting the 3D objects naturally fade in and out like a living cloud. This opens the door for robots to navigate dynamic worlds, for AR games to work anywhere, and for us to explore 3D versions of our daily lives without waiting for a supercomputer to finish its homework.