HDR-NSFF: High Dynamic Range Neural Scene Flow Fields

Imagine you are trying to take a video of a busy street scene. You have a standard camera, but the lighting is tricky: the sun is blazing on some buildings (blindingly bright), while the alleyways are pitch black.

If you take a photo with your camera set for the dark alley, the sunlit buildings turn into a blank, white blob (overexposed). If you set it for the bright sun, the alley turns into a black void (underexposed). You lose all the details in both places.

HDR-NSFF is a new, super-smart computer program that solves this problem, but with a twist: it doesn't just fix a single photo; it fixes an entire moving video and lets you look at it from angles you never actually filmed.

Here is how it works, explained with some everyday analogies:

1. The Problem: The "Jigsaw Puzzle" That Doesn't Fit

Traditional methods try to fix this by taking three photos of the same moment (one dark, one bright, one normal) and gluing them together like a jigsaw puzzle.

The Flaw: If a car drives by or a person waves their hand, the "glue" fails. The puzzle pieces don't line up perfectly because the object moved between the three photos. This creates "ghosts" (blurry double images) and flickering colors. It's like trying to glue a puzzle together while someone is shaking the table.

2. The Solution: Building a "Time-Traveling 3D Model"

Instead of gluing 2D pictures together, HDR-NSFF builds a 4D digital model of the scene. Think of it like this:

Old Way: Stacking 2D sheets of paper (frames) on top of each other.
HDR-NSFF Way: Sculpting a living, breathing 3D statue that moves through time.

The program creates a continuous "cloud" of light and geometry. Because it understands the scene as a 3D object moving through time, it knows that a car is a car, even if the sun makes it look white in one frame and black in the next. It doesn't just stitch pixels; it understands the physics of the scene.

3. The Secret Sauce: Three Magic Tricks

To make this 3D model work, the authors used three clever tricks:

A. The "Soul Tracker" (Semantic Flow)

Usually, computers track movement by looking at colors (e.g., "that red pixel moved to the left"). But in HDR videos, colors change wildly because of the exposure. A red car might look white in a bright frame and dark red in a shadow.

The Fix: The program ignores the color and looks at the "soul" (semantics) of the object. It uses a tool called DINO (like a super-recognizer) that knows, "That is a car," regardless of whether it's glowing white or shadowed black. It tracks the object, not the pixel color, ensuring the movement stays smooth and ghost-free.

B. The "Imagination Engine" (Generative Prior)

Sometimes, the camera is so bright that a part of the scene is completely blown out (pure white), or so dark it's pure black. There is literally no information there. It's like trying to paint a picture of a face where the nose is missing.

The Fix: The program uses a "generative prior," which is basically a creative imagination. It looks at the surrounding context and asks, "What should be in this missing spot?" It fills in the missing details with a highly educated guess that fits the rest of the scene, effectively "hallucinating" the missing details in a way that looks real.

C. The "Universal Translator" (Tone Mapping)

The camera sees the world in "Low Dynamic Range" (LDR)—a limited range of colors. The real world is "High Dynamic Range" (HDR)—a massive range.

The Fix: The program learns a custom "translator" (Tone Mapping) that converts the limited camera data back into the full, rich reality of the scene. It learns exactly how the camera squashed the light and reverses the process mathematically.

4. The New Playground: The HDR-GoPro Dataset

To prove this works, the team didn't just use computer simulations. They built a real-world test lab.

They set up nine GoPro cameras in a circle.
They programmed them to take pictures at different brightness levels (some dark, some bright, some normal) in a rapid-fire sequence.
This created the first-ever "HDR-GoPro Dataset," a goldmine of real-world data with moving people, cars, and changing light, which they used to train their AI.

The Result: What Can You Do With It?

Once HDR-NSFF is trained on this video, you can do two amazing things:

See the Unseen: You can generate a video of the scene from a camera angle that wasn't there (e.g., "Show me what the scene looked like from behind that tree").
Time Travel: You can freeze time or slow down the action, and the computer will fill in the gaps perfectly, keeping the lighting consistent and the motion smooth.

In summary: HDR-NSFF is like a time-traveling 3D sculptor. Instead of just pasting photos together, it builds a perfect, moving 3D world that understands light, motion, and objects, allowing you to see a scene exactly as it should look, free of ghosts, flickers, and missing details.

Here is a detailed technical summary of the paper "HDR-NSFF: High Dynamic Range Neural Scene Flow Fields", published as a conference paper at ICLR 2026.

1. Problem Statement

Real-world scenes possess a dynamic range of radiance far exceeding the capabilities of standard digital sensors. While High Dynamic Range (HDR) imaging aims to recover lost information by merging alternating-exposure frames, existing methods face significant limitations:

2D Constraints: Conventional video HDR methods rely on 2D pixel-level alignment within narrow temporal windows (3–7 frames). They lack a physical understanding of the 3D scene.
Artifacts: In dynamic scenes with large motions or occlusions, these 2D approaches suffer from ghosting, color drift, and geometric flickering.
Exposure Inconsistency: Standard optical flow and motion priors fail when frame-to-frame color inconsistencies are caused by alternating exposures, leading to unreliable motion estimation.
Information Scarcity: Monocular HDR reconstruction is an ill-posed problem due to limited viewpoints and frequent saturation (clipping) in over/under-exposed regions, resulting in missing geometric and radiometric data.

2. Methodology: HDR-NSFF

The authors propose HDR-NSFF, a paradigm shift from 2D pixel fusion to 4D spatio-temporal modeling. The framework reconstructs dynamic HDR radiance fields, 3D scene flow, geometry, and tone-mapping jointly from alternating-exposure monocular videos.

Core Architecture

Built upon the Neural Scene Flow Fields (NSFF) framework, HDR-NSFF represents the scene as a continuous function of space and time ( $x, y, z, t$ ). It decomposes the scene into:

Static Branch: Models time-invariant geometry.
Dynamic Branch: Models time-varying appearance and motion, conditioned on time $t$ .
Scene Flow: Explicitly predicts 3D displacement vectors ( $F_{t \to t \pm 1}$ ) to warp 3D points across time, ensuring temporal consistency in 3D space rather than 2D pixels.

Key Technical Components

A. Learnable Tone-Mapping Module
To bridge the gap between varying Low Dynamic Range (LDR) observations and the underlying HDR radiance, the authors introduce a learnable tone-mapping module $T$ .

Function: Maps rendered HDR radiance $E$ to LDR $C$ via $C = T(E; \theta) = g_\theta(w(E))$ .
Components: Includes per-channel white balance ( $w$ ) and a learnable Camera Response Function (CRF, $g_\theta$ ).
Robustness: Employs a leaky-thresholded CRF to mitigate saturation effects and smoothness regularization (penalizing second-order derivatives) to ensure physically plausible CRF shapes.

B. Exposure-Robust Semantic Flow Estimation
Standard optical flow (e.g., RAFT) fails under alternating exposures due to drastic color shifts.

Solution: The authors leverage DINOv2 features, which are semantically invariant to exposure changes.
Implementation: They utilize DINO-Tracker to estimate dense, exposure-invariant motion.
Refinement: Tracking is re-initialized at each timestep to prevent error accumulation, and SAM2 masks are used to restrict tracking to dynamic regions, filtering out background noise.

C. Generative Prior Regularization
To address the ill-posed nature of monocular HDR (limited views + saturation), the method incorporates a generative prior (based on diffusion models) as a regularizer.

Mechanism: The system synthesizes "enhanced" novel views from unseen perspectives using a generative model $G$ . These enhanced views serve as pseudo-labels.
Loss: A patch-wise perceptual loss aligns the rendered HDR field with these generated pseudo-observations.
Strategy: The generative loss is applied with a scheduled activation (warm-up period) and low sampling probability to prevent hallucinations while filling in saturated or unobserved regions with semantically plausible structures.

D. Objective Function
The model is trained end-to-end by minimizing a composite loss:
$\mathcal{L} = \mathcal{L}_{cb} + \mathcal{L}_{photo} + \beta_{data}\mathcal{L}_{data} + \beta_{reg}\mathcal{L}_{reg} + \beta_{smooth}\mathcal{L}_{smooth} + \mathcal{L}_{gen}$
Where terms include combined rendering loss, photometric loss, flow/depth priors, smoothness constraints, and the generative prior loss.

3. Key Contributions

4D HDR Framework: HDR-NSFF is the first framework to jointly reconstruct HDR radiance, 3D scene flow, geometry, and tone-mapping from alternating-exposure monocular videos, shifting the paradigm from 2D fusion to 4D modeling.
Robust Learning Strategies:
- Introduction of semantic-based optical flow using DINOv2 to achieve exposure-invariant motion estimation.
- Integration of generative priors as a regularizer to recover information lost to saturation and limited viewpoints.
HDR-GoPro Dataset: The authors present the first real-world benchmark for dynamic HDR space-time view synthesis. It features nine synchronized GoPro cameras capturing scenes at three distinct exposure levels (low, mid, high), enabling rigorous evaluation of cross-exposure reconstruction.

4. Experimental Results

The method was evaluated on both synthetic data and the new HDR-GoPro dataset, comparing against baselines like NSFF, 4DGS, MotionGS, NeRF-WT, and HDR-HexPlane.

Quantitative Performance: HDR-NSFF achieves State-of-the-Art (SOTA) performance in Novel View Synthesis (NVS) and Novel View & Time Synthesis.
- On the HDR-GoPro dataset, it achieved a PSNR of 32.63 (vs. 20.70 for HDR-HexPlane) and SSIM of 0.9444.
- It significantly outperforms baselines in "Dynamic only" scenarios, demonstrating superior handling of complex motion.
Qualitative Results:
- Detail Recovery: Successfully recovers fine details in both overexposed (saturated) and underexposed regions where other methods fail or produce artifacts.
- Temporal Coherence: Eliminates ghosting and flickering, producing smooth video synthesis across time.
- Generalization: The approach is representation-agnostic; when integrated into a 4D Gaussian Splatting (4DGS) pipeline (MoSca), it similarly outperforms baselines, confirming the utility of the tone-mapping and semantic flow modules.
Ablation Studies:
- Removing the DINO-Tracker (reverting to standard RAFT) causes significant performance drops in dynamic scenes.
- Removing the Generative Prior leads to poorer reconstruction in saturated regions.
- The piecewise parametric CRF design proved superior to fixed or fully MLP-based CRFs.

5. Significance

HDR-NSFF represents a major advancement in computational photography and 3D scene reconstruction. By unifying HDR recovery with 4D dynamic scene flow, it solves the fundamental trade-off between capturing high dynamic range and maintaining temporal/3D consistency.

Practical Impact: Enables the creation of high-fidelity, viewable 3D scenes from standard consumer cameras (like GoPros) using alternating exposure settings, which is crucial for applications in autonomous driving, VR/AR, and robotics where lighting conditions vary wildly.
Scientific Contribution: It demonstrates that semantic invariance (via foundation models like DINO) and generative priors are essential tools for solving ill-posed inverse problems in computer vision, particularly when physical observations are degraded by sensor limitations.

The code and the new HDR-GoPro dataset are made available to the community, fostering further research in dynamic HDR reconstruction.