Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

Imagine you are trying to build a perfect, 3D hologram of a busy street scene, but you only have a single, shaky handheld video. To make things even harder, the person filming is constantly switching between "night mode" (dark, grainy) and "day mode" (bright, washed out) as they walk.

This is the problem Mono4DGS-HDR solves. It's a new computer program that can take that messy, flickering video and turn it into a crystal-clear, high-definition 3D world where you can look around from any angle, and the lighting is perfect—bright enough to see the sun's glare, but dark enough to see the shadows.

Here is how it works, explained with some everyday analogies:

The Problem: The "Flickering Camera"

Most 3D reconstruction tools are like a painter who needs a steady hand and consistent lighting. If you give them a video where the brightness jumps up and down wildly (alternating exposures), they get confused. They might think a shadow is a hole in the wall, or they might get dizzy trying to figure out where the camera is moving.

The Solution: A Two-Step "Rehearsal and Performance"

The authors of this paper created a system that works in two distinct stages, like a play rehearsal followed by the actual show.

Stage 1: The "Flat Rehearsal" (Orthographic Space)

Instead of trying to build the 3D world immediately, the system first creates a 2D "flat" version of the scene.

The Analogy: Imagine looking at a movie screen where the characters are moving, but the screen itself is flat. The system ignores the camera's wobbly movement for a moment. It just focuses on making the characters (the objects) look bright and clear on this flat screen, regardless of how the camera is shaking.
Why? By pretending the camera is a giant, perfect projector (an "orthographic" camera) that doesn't move, the computer can easily figure out the correct colors and brightness (High Dynamic Range) without getting confused by the camera's shaky path. It creates a "perfectly lit" video of the scene.

Stage 2: The "3D Performance" (World Space)

Once the system has a perfect, bright video from Stage 1, it takes that video and pops it into 3D.

The Analogy: Now, imagine taking that flat movie and inflating it into a real 3D balloon. The system takes the "perfectly lit" video it learned in Stage 1 and uses it as a guide to build the real 3D world. Because it already knows what the scene should look like (bright and clear), it can now figure out exactly where the camera was moving and how the objects are shaped in 3D space.
The Magic: It uses a technique called Gaussian Splatting. Think of this not as building with Lego bricks, but as painting with thousands of tiny, glowing, 3D clouds (splats). Some clouds are static (like a building), and some are moving (like a skateboarder). The system figures out the path of every single cloud.

The Secret Sauce: "Time-Traveling Consistency"

One of the biggest headaches in this task is "flickering." If you watch a 3D video, sometimes a car might look blue in one frame and purple in the next, even though it's the same car.

The Fix: The authors added a "Time-Traveling Consistency" rule (Temporal Luminance Regularization).
The Analogy: Imagine a group of dancers. If one dancer suddenly changes their costume color in the middle of a routine, it looks weird. This system acts like a strict choreographer who says, "If you were red in the last second, you must be red in this second, even if the lighting changes." It forces the 3D clouds to stay consistent in color and brightness over time, so the video looks smooth and stable.

Why is this a big deal?

Before this, if you wanted to make a 3D HDR video, you needed:

A bunch of cameras (not just one).
A tripod (no shaky hands).
Perfectly known camera positions.

Mono4DGS-HDR is the first system that says, "Give me one shaky phone video where the brightness keeps changing, and I'll build you a perfect 3D world."

The Result

When they tested it, their system was:

Faster: It renders the video in real-time (like a video game).
Better: It produces fewer glitches and artifacts than trying to just "fix" existing 3D tools.
Smarter: It can handle moving people, cars, and even complex lighting changes that would confuse other AI.

In short, they taught a computer how to look at a messy, flickering video and imagine the perfect, high-definition 3D world hidden inside it.

1. Problem Definition

The paper addresses the challenging task of reconstructing 4D High Dynamic Range (HDR) scenes (dynamic scenes with time-varying geometry and appearance) from unposed monocular Low Dynamic Range (LDR) videos captured with alternating exposures.

Input: A single handheld camera video where frames alternate between short and long exposure times (e.g., $L_1, L_3, \dots$ are short; $L_2, L_4, \dots$ are long). Camera poses are unknown.
Goal: Recover a renderable 4D HDR scene representation (allowing for novel view synthesis and novel exposure rendering) and estimate camera motion.
Challenges:
- Unknown Poses: Standard photometric reprojection losses fail because brightness varies drastically between frames due to exposure changes.
- Noisy Priors: Existing 2D priors (depth, optical flow) extracted from alternating-exposure videos are often noisy or incomplete, leading to poor initialization.
- Temporal Inconsistency: Without direct HDR supervision, recovered HDR appearances often suffer from flickering and color artifacts across time.
- Lack of Benchmarks: No existing method or benchmark specifically targets unposed monocular HDR reconstruction from alternating exposures.

2. Methodology: Mono4DGS-HDR

The authors propose a unified framework based on Gaussian Splatting (3DGS) featuring a novel two-stage optimization approach and specific regularization strategies.

A. Preliminaries

HDR Representation: Instead of LDR color $c \in [0,1]$ , the system uses HDR irradiance $e \in [0, +\infty)$ . A logarithmic-domain tone mapper $\phi$ converts HDR irradiance to LDR color based on exposure time $\Delta t$ : $c = \phi(\log(e \Delta t))$ .
Dynamics: Dynamic Gaussians are modeled using Cubic Hermite Splines for position trajectories and cubic polynomials for rotation quaternions. Scaling, opacity, and color are assumed time-invariant for simplicity.

B. Stage 1: Video HDR Gaussian Learning (Orthographic Space)

To bypass the need for initial camera poses, the first stage optimizes Gaussians in a 3D canonical orthographic camera coordinate space.

Concept: Inspired by SaV, the system treats camera motion and object motion uniformly as the motion of dynamic Gaussians within a fixed orthographic volume.
Initialization: Video Gaussians are initialized by lifting 2D priors (depth and tracklets) from vision foundation models into the orthographic space.
Optimization: The system optimizes dynamic HDR Gaussians to fit the alternating-exposure LDR frames.
Benefit: This eliminates the dependency on camera poses for the initial reconstruction, providing a robust, consistent HDR video representation.

C. Stage 2: Video-to-World Transformation & Joint Refinement

The learned video Gaussians are transformed into World Space and jointly refined with camera poses.

Transformation Strategy:
1. Dynamic/Static Identification: Uses epipolar error maps and depth ordering to distinguish static vs. dynamic Gaussians, handling occlusions to prevent misclassification.
2. Attribute Transformation: Positions and rotations are transformed using initial camera parameters (from Bundle Adjustment).
3. Scaling Re-fitting (2D Covariance Invariance): A critical step where Gaussian scaling is re-fitted to ensure the projected 2D covariance of the transformed world Gaussians matches the original video Gaussians. This prevents unreasonable scale initialization.
Joint Optimization: In the second stage, the system jointly optimizes world Gaussians and camera poses using the transformed Gaussians as a high-quality initialization.

D. Key Loss Functions & Regularization

2D Prior Supervision: Uses LDR RGB, depth, flow/track, and unit exposure losses to guide geometry and appearance.
Gaussian Motion Regularization: Enforces rigidity (ARAP), velocity, and acceleration constraints to ensure plausible motion.
Temporal Luminance Regularization (TLR): A novel loss to ensure temporal consistency. It aligns per-pixel HDR irradiance between consecutive frames using flow-guided photometric warping. This prevents "floaters" and flickering in dynamic regions where supervision might be weak (e.g., over/under-exposed frames).
HDR Photometric Reprojection Loss: Once the HDR video is recovered in Stage 1, it is used as a "pseudo-ground truth" to compute photometric reprojection losses in Stage 2, enabling robust camera pose optimization despite alternating exposures.

3. Key Contributions

First System: Mono4DGS-HDR is the first system to tackle 4D HDR reconstruction from unposed monocular videos with alternating exposures.
Two-Stage Framework: Introduces a novel pipeline that first learns a pose-free video Gaussian representation in orthographic space, then transforms and refines it in world space. This decouples the difficult pose estimation from the initial HDR learning.
Video-to-World Transformation: Proposes a specific strategy involving occlusion handling and 2D covariance invariance to ensure geometrically reasonable initialization for world-space optimization.
Temporal Luminance Regularization: Introduces a flow-guided loss to stabilize HDR appearance over time, solving the flickering issue common in dynamic HDR reconstruction.
New Benchmark: Constructs a new evaluation benchmark using public datasets (Syn-Exp-3, Real-Exp-3, Real-Exp-2) containing alternating-exposure videos, as no such benchmark previously existed.

4. Experimental Results

The authors evaluated Mono4DGS-HDR against adapted state-of-the-art methods (GaussHDR, HDR-HexPlane, SplineGS, MoSca, GFlow) on both synthetic and real-world datasets.

Quantitative Performance:
- Rendering Quality: Outperforms all baselines significantly. On the Syn-Exp-3 test set, it achieves 37.64 PSNR (HDR) compared to the next best (MoSca-HDR at 36.89).
- Temporal Stability: Achieves the lowest HDR-TAE (Temporal Alignment Error) of 0.057, indicating superior temporal consistency compared to baselines (e.g., MoSca at 0.059, SplineGS at 1.188).
- Speed: Renders at 161 FPS (864x480), significantly faster than HDR-HexPlane (1 FPS) and competitive with other 4D methods.
Qualitative Performance: Visual comparisons show that Mono4DGS-HDR produces sharper details, fewer artifacts, and more consistent HDR lighting in dynamic scenes compared to baselines, which often suffer from blurring, color shifts, or geometry collapse.
Ablation Studies:
- Removing the Video Gaussian Initialization causes a >1dB drop in PSNR.
- Removing Occlusion Handling or 2D Covariance Invariance leads to blurry backgrounds and unreasonable scales.
- Removing Temporal Luminance Regularization drastically increases HDR-TAE (flickering), though PSNR remains similar.

5. Significance

This work bridges a critical gap in 3D computer vision by enabling HDR reconstruction from casual, single-camera videos.

Practicality: It removes the need for multi-camera setups or known camera poses, making it applicable to real-world scenarios captured with smartphones or handheld cameras.
Robustness: The two-stage approach effectively handles the ill-posed nature of unposed HDR reconstruction by leveraging the stability of orthographic video learning before moving to world space.
Foundation for Future Work: By establishing a benchmark and demonstrating the feasibility of self-supervised 4D HDR reconstruction, this paper opens new avenues for applications in autonomous driving, AR/VR, and high-quality video editing where dynamic range and temporal consistency are paramount.