High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System

Imagine you are trying to take a perfect photo of a scene that has both a blindingly bright sun and a pitch-black shadow.

The Problem:
Your normal camera is like a human eye that can't adjust fast enough. If you set it to see the dark shadow, the sun turns into a giant, white, featureless blob (overexposure). If you set it to see the sun, the shadow turns into a black void where you can't see anything. This is the "High Dynamic Range" (HDR) problem.

The Old Solutions:

The "Bracketing" Method: Take three photos quickly (one dark, one normal, one bright) and stitch them together. Downside: If anything moves (like a car or a person), the final image looks ghostly or blurry.
The "Event Camera" Method: Instead of taking full photos, some new cameras only record changes (like a motion sensor). They are super fast and never get blinded by bright light. Downside: They don't know what the actual colors or brightness levels look like; they just know "something moved here."
The "SVE Camera" Method: This is a special camera that takes one photo but splits the light into four different "exposures" on the same sensor at the same time. It's like having four cameras in one, but the image comes out looking like a mosaic puzzle that needs to be solved.

The New Solution (This Paper):
The researchers built a super-camera system that combines two different types of cameras:

An SVE Camera (the "Detail & Color" expert).
An Event Camera (the "Speed & Motion" expert).

They put these two cameras side-by-side (not looking through the exact same lens), creating a unique, asymmetric setup.

The Three Magic Steps

Here is how they make these two different cameras work together to create a perfect, ghost-free, high-dynamic-range image:

1. The "Handshake" (Alignment)

Because the two cameras are in different spots and look at the world from slightly different angles, their images don't line up perfectly. It's like trying to overlay two maps of the same city that were drawn by different people with different scales.

The Fix: They use a two-step "alignment" process. First, they do a rough alignment (like putting a map on a table and sliding it until the continents roughly match). Then, they use a smart AI to do a fine-tuning (zooming in and adjusting the pixels so the streets match perfectly). They use a special "frequency filter" (think of it as a noise-canceling headphone for images) to ignore the messy parts and focus only on the sharp edges that both cameras agree on.

2. The "Brain" (Fusion Network)

Now that the images line up, they need to be combined.

The SVE Camera says: "I know the colors and the brightness, but I might be blurry if things moved fast."
The Event Camera says: "I know exactly where the edges are and how fast things moved, but I don't know the colors."
The AI Brain: Instead of just averaging them, the AI acts like a smart editor. It looks at every single pixel and asks, "Who is the boss here?"
- In a bright, sunny spot? It trusts the Event Camera because the SVE camera might be blinded.
- In a dark, quiet corner? It trusts the SVE camera because the Event camera might be too quiet to see anything.
- The Secret Sauce: They invented a "Learnable Fusion Loss." Imagine a conductor leading an orchestra. Instead of telling the violin and the drums to play at the same volume forever, the conductor listens to the music and tells the violin to get louder when the drums get quiet, and vice versa. The AI learns to do this automatically for every part of the image.

3. The Result

The final output is a single, crystal-clear image.

Highlights: The sun isn't a white blob; you can see the clouds and the texture of the sky.
Shadows: The dark corners aren't black holes; you can see the details in the shadows.
Motion: Fast-moving cars or people are sharp, with no "ghosting" or blurring.

Why is this a big deal?

Think of it like cooking a perfect stew.

The SVE camera provides the rich broth (the base flavor and color).
The Event camera provides the fresh herbs and spices (the sharp details and timing).
Old methods just dumped them in a pot and stirred.
This new system is like a Master Chef who tastes the stew as it cooks, adding more spice when the broth is too mild, or more broth when the spices are too strong, ensuring every bite is perfect.

Real-World Use

This technology is huge for:

Self-driving cars: Seeing a dark tunnel exit into bright sunlight instantly without getting "blinded."
Robotics: Helping robots navigate fast-moving, chaotic environments.
Scientific imaging: Capturing explosions or high-speed machinery without losing detail.

In short, this paper teaches two very different cameras how to hold hands, argue out their differences, and work together to see the world exactly as it is—bright, dark, fast, and slow, all at once.

1. Problem Statement

High Dynamic Range (HDR) imaging remains a significant challenge in environments with extreme illumination variations (e.g., tunnels, nighttime navigation, high-speed industrial inspection). Conventional frame-based cameras suffer from saturation (overexposure) or loss of detail in shadows due to limited dynamic range and readout speeds.

Limitations of Existing Solutions:
- Multi-exposure bracketing: Prone to motion blur and ghosting.
- Single-image HDR: Struggles to recover saturated highlights reliably.
- Event-only HDR: Lacks absolute intensity information and suffers from noise/contrast bias.
- Current Fusion Methods: Most existing Event+Frame fusion systems assume coaxial (aligned) optical paths or homogeneous exposure. They often rely on fixed fusion losses that cannot adapt to the specific radiometric heterogeneity of Spatially Varying Exposure (SVE) sensors or the asynchronous nature of event cameras.
The Specific Gap: There is a lack of systems that co-design hardware and algorithms for asymmetric, non-coaxial setups combining SVE cameras (providing spatial multi-exposure diversity) and Event cameras (providing microsecond temporal resolution), specifically addressing the geometric misalignment and heterogeneous data characteristics.

2. Methodology

The authors propose a hardware-algorithm co-designed framework consisting of a custom asymmetric imaging rig and a novel deep learning pipeline.

A. Hardware System

Configuration: An asymmetric dual-modality system where an SVE camera (custom prototype with a $2\times2$ micro-attenuation mosaic) and an Event camera (Prophesee EVK4) are mounted with independent optical paths (non-coaxial, 50mm baseline).
Synchronization: A hardware trigger synchronizes the SVE exposure cycles with the event stream at 60 Hz, eliminating timestamp offsets and ensuring deterministic pairing of SVE frames with event intervals.
Data Characteristics:
- SVE: Captures a single raw mosaic frame containing four distinct exposure levels (attenuation factors: 0.95, 0.45, 0.55, 0.005) per pixel.
- Events: Asynchronously records logarithmic brightness changes with microsecond latency.

B. Algorithmic Framework

The software pipeline follows a two-stage alignment and fusion architecture:

Two-Stage Cross-Modal Alignment:
- Stage 1: Coarse Homography Estimation: Uses a feature-driven, detector-free matcher to estimate a global homography ( $H^*$ ) between the SVE image and the accumulated event frame. This corrects global geometric discrepancies caused by the non-coaxial setup.
- Stage 2: Learnable Fine Alignment: A multi-scale refinement module addresses residual local parallax and asynchronous sampling errors. It employs:
  - Spatial Pooling: To stabilize matching cues against sparse events and noise.
  - Frequency-Domain Convolution (FDConv): A novel module that transforms features to the frequency domain to separate structure-dominant (high-frequency) components from radiometry-dominant (low-frequency) components. This allows the network to learn a spectral response that attenuates unstable cross-modal components, improving alignment robustness under extreme lighting.
Cross-Modal HDR Reconstruction Network:
- Architecture: A dual-branch encoder-decoder (UNet-like) structure.
  - SVE Branch: Demultiplexes the mosaic into four exposure images, encodes them into a feature pyramid, and aggregates them.
  - Event Branch: Encodes the accumulated event stream into a dense spatio-temporal embedding.
- Fusion Mechanism: Features are fused at multiple scales using Single-Scale Fusion (SSF) blocks. The network uses Mutual Information (MI) regularization to enforce modality-invariant structural consistency between the two branches.
- Learnable Fusion Loss: Instead of fixed weights, a lightweight modulation network ( $G_\theta$ $G_{θ}$ ) predicts pixel-wise modality weights ( $w_{exp}, w_{evt}$ $w_{e x p}, w_{e v t}$ ).
  - In saturated regions, the network learns to rely more on the highly attenuated SVE sub-exposures.
  - In motion-blurred or low-light regions, it relies more on event-derived structural cues.
  - The total loss combines reconstruction losses ( $L_1$ , SSIM, VGG), alignment loss, MI loss, and the adaptive fusion loss.

3. Key Contributions

Asymmetric Event-SVE Prototype: Built a synchronized hardware system with independent optical paths, explicitly addressing the challenges of non-coaxial geometry and heterogeneous optics in HDR imaging.
Two-Stage Alignment Framework: Proposed a novel alignment strategy combining calibration-guided coarse homography with a learnable, frequency-domain-based fine refinement module to handle residual parallax and modality-specific noise.
Adaptive Fusion Network: Developed a cross-modal HDR reconstruction network featuring:
- Frequency-domain convolution for robust feature alignment.
- Mutual information regularization for structural consistency.
- A Learnable Fusion Loss that dynamically reweights SVE and event contributions based on local scene reliability (e.g., saturation vs. motion).

4. Experimental Results

The system was evaluated on both synthetic benchmarks (with ground truth) and real-world captures (without ground truth).

Quantitative Performance (Synthetic):
- Achieved state-of-the-art results with 24.241 PSNR and 0.935 SSIM, outperforming baselines like HDRev-Net (24.071 PSNR) and E2VID.
- Lowest LPIPS (0.102), indicating superior perceptual similarity to ground truth.
Qualitative Performance (Real-world):
- Highlight Recovery: Successfully recovered details in overexposed regions where frame-only methods failed (saturation) and event-only methods lacked texture.
- Edge Fidelity: Preserved sharp boundaries and reduced ghosting artifacts in dynamic scenes compared to non-aligned fusion methods.
- No-Reference Metrics: Achieved the best PIQE (12.82) and highest Entropy (6.91) on real captures, indicating superior perceptual sharpness and information richness.
Ablation Studies: Confirmed that removing the alignment module caused severe ghosting, and removing the learnable fusion loss reduced highlight recovery capabilities, validating the necessity of each component.

5. Significance

This work demonstrates that jointly optimizing optical design, geometric alignment, and computational fusion is critical for reliable HDR perception in challenging environments.

Robustness: The system effectively handles the "worst-case" scenarios of both modalities (SVE saturation and Event sparsity) by adaptively weighting them.
Generalizability: The proposed frequency-domain alignment and learnable fusion strategies offer a new paradigm for fusing heterogeneous sensors beyond just Event+SVE, applicable to any asymmetric multi-modal imaging system.
Application Potential: The framework provides a foundation for high-speed robotics, autonomous driving in tunnels, and scientific imaging where extreme dynamic range and motion are simultaneously present.

Limitations & Future Work: The authors note that extremely fast motion can still cause SVE blur, and very low-light scenes may result in sparse events. Future work aims to integrate motion-aware deblurring and adaptive event-threshold modeling.