gQIR: Generative Quanta Image Reconstruction

Imagine trying to take a clear, high-definition photo of a speeding race car, but you are only allowed to catch one or two photons (tiny particles of light) hitting your camera sensor for the entire picture.

In the real world, this is what happens with SPAD cameras. These are super-sensitive sensors used for ultra-high-speed photography (like capturing a bullet breaking glass or a jet engine spinning). The problem is that the raw data they produce is incredibly messy. It's like trying to assemble a 1,000-piece puzzle where 99% of the pieces are missing, and the ones you have are covered in static noise.

The paper introduces gQIR (Generative Quanta Image Reconstruction), a new AI method that acts like a "super-photographer" to fix these broken images. Here is how it works, broken down into simple concepts:

1. The Problem: The "Starving" Camera

Think of a normal camera as a bucket catching rain. If it rains hard, you get a full bucket (a clear photo). A SPAD camera is like a tiny thimble in a drought. It only catches a few drops.

The Result: The raw image looks like a sparse, black-and-white static noise. It's binary (either a photon hit or it didn't).
The Challenge: To make a picture, you have to take a "burst" of thousands of these tiny, noisy snapshots and stitch them together. But because the objects are moving so fast, the pieces don't line up, and there isn't enough light to guess what the picture should look like.

2. The Solution: The "Artistic Detective" (gQIR)

Instead of just trying to clean up the noise mathematically (like a standard photo editor), gQIR uses a Generative AI (specifically, a model trained on millions of internet photos) to "hallucinate" the missing details.

Think of it like this:

Old Method: You have a blurry, dark photo of a dog. You try to sharpen the pixels. It stays blurry.
gQIR Method: You show the AI the blurry, dark photo and say, "I know this is a dog, even though I can barely see it." The AI says, "Okay, I know what dogs look like. I will fill in the fur, the eyes, and the nose based on my memory of millions of dogs, while keeping the shape you gave me."

3. The Three-Step Process (The Pipeline)

The authors built a three-stage factory to turn this mess into a masterpiece:

Stage 1: The "Translator" (VAE Alignment)
The AI first learns to speak the language of the SPAD camera. It takes the messy, binary "dots" and translates them into a clean, internal representation. It's like teaching a translator who only speaks "static noise" to understand "English." They do this carefully so the AI doesn't forget what it learned about real images (a problem called "catastrophic forgetting").
Stage 2: The "Artistic Enhancer" (Perceptual Boost)
Now that the image is clean but maybe a bit flat, this stage uses the AI's "imagination." It adds back the sharp edges, textures, and colors that the camera missed. It's like a painter looking at a sketch and adding the final, vibrant brushstrokes to make it look real. This happens in a single step, making it fast.
Stage 3: The "Time-Traveler" (Burst Fusion)
This is the magic for moving objects. The camera takes a burst of frames (like a rapid-fire camera).
- The Problem: If you just average them, a moving car looks like a smear.
- The Fix: gQIR uses a special "Transformer" (a type of AI brain) to look at all the frames, figure out exactly how the object moved, and merge them perfectly. It's like a director taking 100 different takes of a scene and splicing them together to create one perfect, smooth shot where the car is sharp and the background is stable.

4. Why This Matters

Speed: It can reconstruct images from cameras shooting at 100,000 frames per second. That's fast enough to see a bullet in mid-air or a balloon popping.
Color: Previous methods could only do black and white. This one handles color by figuring out the missing red, green, and blue dots from the sparse data.
Real World: They tested it on real, extreme scenarios (like a tank firing or a propane explosion) and it worked better than any previous method, producing photos that look like they were taken with a normal, expensive camera.

The Bottom Line

gQIR is like giving a blindfolded artist a few scattered clues and saying, "Draw a realistic picture of this scene." By combining the raw data from super-sensitive sensors with the "common sense" of a massive AI trained on the internet, it can reconstruct beautiful, high-speed, full-color images from almost nothing. It turns "noise" into "art."

1. Problem Statement

The paper addresses the fundamental challenge of reconstructing high-quality images from extremely sparse photon counts, specifically using Single-Photon Avalanche Diode (SPAD) sensors.

The Challenge: SPAD sensors operate in a "Bernoulli mode," recording binary values (1 if a photon is detected, 0 otherwise) per pixel per exposure. Individual frames (quanta frames) are dominated by shot noise, quantization artifacts, and extreme sparsity.
The Gap: Conventional restoration pipelines and standard deep learning models (trained on Gaussian noise or standard camera data) fail in this regime. They cannot handle the non-Gaussian Bernoulli statistics, the extreme sparsity, or the complex interplay of motion estimation, alignment, and denoising required for "quanta bursts" (sequences of binary frames).
Specific Difficulty: Existing learning-based methods struggle with ultra-high-speed motion (up to 100k fps) and extreme deformation, often resulting in blur or failure to align frames. Furthermore, prior work largely focused on monochrome sensors; color reconstruction is significantly harder due to the sparsity of photons in each color channel (CFA mosaicing).

2. Methodology: gQIR Framework

The authors propose gQIR, a modular three-stage framework that adapts large-scale Text-to-Image (T2I) Latent Diffusion Models (specifically Stable Diffusion) to the photon-limited domain.

Stage 1: Quanta-Aligned VAE (Joint Denoising & Demosaicing)

Goal: To map the noisy, binary, mosaiced quanta input into the latent space of a pre-trained diffusion model while preserving structural fidelity.
Problem Addressed: Standard fine-tuning of VAE encoders for degradation removal leads to catastrophic forgetting, where the encoder learns a "shortcut" to output a constant, smoothed image regardless of input.
Key Innovations:
- Deterministic Mean Encoding: Instead of stochastic sampling from the posterior, the encoder uses the deterministic mean ( $\mu$ ) to avoid amplifying variance from photon shot noise.
- Latent Space Alignment (LSA) Loss: The encoder is trained to align the latent representation of the noisy input with the latent representation of the ground truth (generated by a frozen pre-trained encoder). This prevents the encoder from collapsing.
- Loss Function: Combines LSA loss, pixel-space MSE, and perceptual loss (LPIPS).

Stage 2: Perceptual Enhancement (Adversarial Finetuning)

Goal: To refine the reconstruction, enhancing high-frequency details and photorealism.
Approach:
- Uses LoRA (Low-Rank Adaptation) to fine-tune the latent U-Net of the diffusion model.
- Employs Adversarial Training (GAN) to distill the multi-step diffusion prior into a single-step generator. This is crucial for efficiency given the massive data rates of SPAD sensors.
- The generator is initialized with the pre-trained diffusion weights to ensure stable GAN training (small initial gradients).

Stage 3: Latent Burst Imaging (Spatio-Temporal Fusion)

Goal: To fuse a temporal burst of quanta frames into a single coherent image, handling motion and alignment.
Approach:
- Latent Space Alignment: Instead of aligning raw pixels (which is unreliable due to noise), the method reconstructs intermediate frames, estimates optical flow (using RAFT), and warps the latent representations to a center frame.
- FusionViT: A lightweight spatio-temporal transformer (pseudo-3D miniViT) dynamically merges the aligned latents. Unlike naive averaging (which causes blur), FusionViT adaptively weights latents based on motion and proximity to the reference frame.
- Residual Modulation: The fused latent is residually added to the center latent with a learnable scalar modulation to preserve temporal consistency and prevent content drift.

3. Key Contributions

Adaptation of Generative Priors: First successful adaptation of large-scale T2I diffusion models (Stable Diffusion) to the extreme, discrete, and sparse regime of quanta burst imaging.
Modular Three-Stage Pipeline: A novel framework that separates alignment/denoising (VAE), perceptual enhancement (Adversarial LoRA), and temporal fusion (FusionViT), effectively solving the "alignment vs. denoising" trade-off.
New Datasets:
- The first real-world color SPAD burst dataset.
- A new eXtreme motion + Deforming (XD) video benchmark covering 2k to 100k fps.
State-of-the-Art Performance: Demonstrates significant improvements over classical methods (QBP) and modern learning-based baselines (QUIVER, QuDI) in both fidelity and perceptual quality, especially under ultra-high-speed motion.

4. Experimental Results

The method was evaluated on synthetic benchmarks and real-world data (including the new XD dataset).

Single-Frame Reconstruction:
- Outperforms fine-tuned baselines (NAFNet, Restormer) in perceptual quality (ManIQA, ClipIQA, MUSIQ) while maintaining competitive fidelity.
- Baselines tend to over-smooth high-frequency textures; gQIR preserves sharp details and facial features.
Burst Reconstruction:
- Quantitative: On the XD dataset (2k–100k fps), gQIR achieves a PSNR of 30.33 dB, significantly outperforming QUIVER (20.10 dB) and QBP (12.78 dB).
- Qualitative: Successfully reconstructs sharp textures and coherent structures in extreme scenarios (e.g., ballistics at 100k fps, glass breaking, jet engines) where other methods fail or produce severe motion blur.
Ablation Studies:
- Confirmed that Deterministic Encoding and LSA Loss are critical to prevent encoder collapse.
- Showed that Stage 3 (FusionViT) provides the best trade-off between reconstruction fidelity and temporal stability (reducing flicker/content drift).
Real-World Validation: The model successfully reconstructed images from a real 1MP color SPAD prototype without explicit correction for sensor artifacts (like hot pixels), retaining natural vignetting.

5. Significance and Impact

Paradigm Shift: Moves quanta imaging from task-specific, classical pipelines to general-purpose generative priors, proving that internet-scale models can be adapted to physics-constrained, non-Gaussian sensing problems.
Ultra-High-Speed Imaging: Enables high-fidelity imaging in regimes previously inaccessible to conventional cameras (e.g., capturing ballistics or explosions with color and detail).
Color SPAD Advancement: Solves the specific challenges of color reconstruction in photon-starved environments, a critical step toward practical color SPAD cameras.
Future Directions: The authors note limitations regarding very low-light conditions (PPP $\le$ 1) and HDR capabilities, suggesting future work on conditioning models on photon counts and developing HDR-capable decoders.

In summary, gQIR represents a breakthrough in computational imaging by leveraging the semantic and structural power of large generative models to reconstruct clear, photorealistic images from the sparsest possible photon data.