Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark

Imagine you are trying to take a high-definition, 3D movie of a scene, but instead of a normal camera, you have a camera that only sees in black and white and takes a "smeared" photo. This is the challenge of Spectral Compressive Imaging (SCI).

Normally, cameras capture light as Red, Green, and Blue (RGB). But scientists want to capture the full "rainbow" of light (hundreds of colors) to see things like chemical compositions, hidden materials, or precise health markers. The problem is, capturing all that data usually requires slow, bulky equipment that can't film moving objects.

This paper introduces a new way to film these "rainbow movies" quickly and clearly, even when the camera is taking "smeared" snapshots. Here is the breakdown using simple analogies:

1. The Problem: The "Puzzle" and the "Flickering"

Think of the camera's job like trying to solve a giant jigsaw puzzle where someone has thrown away half the pieces and mixed the rest up.

The Smear (Encoding): The camera uses a special mask (like a stencil) to mix the colors together before taking a picture. This saves space but hides the original details.
The Old Way (Image-by-Image): Previous methods tried to solve this puzzle one photo at a time.
- The Flaw 1: If a piece is missing in one photo, the computer has to guess. It often guesses wrong, creating blurry or "hallucinated" details.
- The Flaw 2: Because it solves each photo separately, the movie looks jittery. One frame might be clear, the next blurry, and the one after that sharp again. It's like a movie where the actors flicker in and out of existence.

2. The Solution: The "Teamwork" Approach

The authors realized that in a video, the frames are connected. If a piece is missing in Frame 1, it might be visible in Frame 2 or Frame 3.

The Analogy: Imagine trying to read a book where some words are crossed out. If you look at just one page, you might miss the meaning. But if you look at the previous and next pages, you can figure out the missing words because the story flows continuously.
The New Method: Instead of solving each frame alone, their new system (called PG-SVRT) looks at the whole sequence of frames together. It uses the clear parts of one frame to "propagate" (share) information to fix the blurry parts of the next.

3. The Three Key Ingredients

A. The New Library: DynaSpec

To teach a computer to do this, you need good training data. Existing datasets were like "stills" cut from a video, which didn't have real movement.

The Analogy: The authors built a new library called DynaSpec. Instead of just showing the computer static pictures, they filmed 30 real-life scenes with moving objects (like a spinning toy or a waving hand) using a super-precise scanner. This gave the AI a "gym" to practice on real-world motion.

B. The Smart Architect: PG-SVRT

This is the new computer brain they built. It has three special tools:

The Decoder (MGDP): It understands exactly how the camera "smears" the image. It's like knowing the specific rules of how the puzzle pieces were mixed up, so it knows how to un-mix them.
The Messenger (CDPA): This is the most important part. It acts like a relay team. It looks at the current frame, grabs the clear details from the previous and next frames, and passes them along to fill in the gaps. It does this efficiently so the computer doesn't get overwhelmed.
The Specialist (MDFFN): It separates the job of fixing "space" (the shape of objects) from fixing "time" (the movement). It handles them separately but then combines them perfectly, ensuring the object looks right and moves smoothly.

C. The Best Camera Setup: DD-CASSI

The authors tested four different camera designs to see which one worked best with their new brain.

The Result: They found that a specific design called DD-CASSI (Dual-Disperser) was the winner.
The Analogy: Imagine trying to read a book through a foggy window. Some windows are just foggy (SD-CASSI), but the DD-CASSI is like a window that has a special filter that spreads the fog out evenly, making it much easier to see the text underneath. This design provided the clearest "smeared" images to start with.

4. The Results: A Smooth, Crystal-Clear Movie

When they tested their system:

Quality: The reconstructed videos were incredibly sharp (over 41 dB quality, which is very high).
Fidelity: The colors were accurate, meaning if you looked at a leaf, it would show the exact chemical signature of a healthy leaf, not a fake one.
Smoothness: The video didn't flicker. The motion was fluid, just like a normal movie.
Efficiency: Despite doing all this complex math, the system was surprisingly lightweight, requiring less computing power than some older, simpler methods.

Summary

In short, the authors built a new dataset (a gym for AI), a new camera setup (the best lens), and a new AI brain (the teamwork solver). Together, they allow us to take fast, high-quality "rainbow movies" of moving objects, solving the mystery of missing information by using the context of the surrounding moments. This opens the door for better autonomous driving, medical imaging, and environmental monitoring.

1. Problem Statement

Spectral Compressive Imaging (SCI) allows for snapshot acquisition of 3D hyperspectral data (spatial + spectral) by compressing it into 2D measurements using coded masks and dispersive elements. However, existing reconstruction methods face two critical limitations when applied to dynamic scenes:

Information Uncertainty: The encoding process (masking) inevitably causes spatial-spectral information loss. Reconstructing missing information from a single compressed measurement is an ill-posed problem with high uncertainty.
Temporal Inconsistency: Current state-of-the-art (SOTA) methods operate on a frame-by-frame basis. This paradigm ignores temporal continuity, leading to flickering and poor consistency in video perception, which is crucial for dynamic scenes.

Furthermore, the field suffers from data scarcity. Existing datasets are primarily designed for static image reconstruction. Pseudo-video sequences generated by slicing static images lack real-world motion dynamics, while existing video-level datasets often suffer from low spectral resolution or unreliable ground truth.

2. Methodology

The authors propose a comprehensive solution involving a new dataset, a novel deep learning architecture, and a hardware benchmark.

A. The DynaSpec Dataset

To address data scarcity, the authors constructed DynaSpec, the first high-quality dynamic hyperspectral image dataset.

Acquisition: Captured using a push-broom hyperspectral camera (GaiaField) on controllable objects.
Content: 30 video sequences (300 total HSIs) featuring diverse, physically realistic motions (translation, rotation, articulated movement).
Specifications: 1280×1280 spatial resolution, 400–700 nm spectral range, 2 nm spectral resolution (151 channels).
Quality Control: Includes spectral correction, intensity calibration, and long integration times to ensure high fidelity and noise reduction.

B. The PG-SVRT Model

The core contribution is the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), a U-Net-based architecture designed to leverage spatiotemporal redundancy. It consists of three key modules:

Mask-Guided Degradation Perception (MGDP):
- Models the specific optical degradation process (masking and dispersion) before the main network.
- It learns the difference between the ideal mask and the degraded measurement to generate a weight map ( $W_\Phi$ ).
- This map guides the network to focus on relevant features and decouple intra-frame encoded information.
Cross-Domain Propagated Attention (CDPA):
- A Spatial-then-Temporal attention mechanism designed to handle high-dimensional spectral data efficiently.
- Spatial Processing: Uses windowed attention with a Bridged Token strategy. Instead of computing attention between all spatial tokens, it pools queries into a smaller set of bridged tokens ( $B_s$ ) to interact with keys and values. This reduces complexity from quadratic to near-linear while maintaining feature propagation.
- Temporal Processing: Rearranges features to compute temporal attention using the output of the spatial step. Crucially, it uses a shared value mechanism to propagate features across domains, ensuring temporal coherence without excessive computational cost.
Multi-Domain Feed-Forward Network (MDFFN):
- Replaces standard MLPs with a structure that processes spatial and temporal features independently via self-attention heads before fusing them.
- This allows the network to extract domain-specific features effectively while promoting their integration.

C. System Benchmarking (DD-CASSI)

The authors conducted comparative simulations across four SCI architectures (SD-CASSI, DD-CASSI, PMVIS, NDSSI). They identified DD-CASSI (Dual-Disperser CASSI) as the superior architecture for video-level reconstruction due to its clear structural representation and high spectral sampling efficiency. They subsequently built a DD-CASSI prototype to validate the method on real-world data.

3. Key Contributions

DynaSpec Dataset: The first high-quality, dynamic hyperspectral video dataset with realistic motion and high spectral fidelity, filling a critical gap in the field.
PG-SVRT Algorithm: A novel transformer-based model that achieves video-level reconstruction with minimal computational cost (low FLOPs) by leveraging spatiotemporal feature propagation and bridged tokens.
Comprehensive Benchmark: A systematic evaluation of four SCI systems and the construction of a real-world DD-CASSI prototype, establishing a new standard for video-level spectral reconstruction.

4. Experimental Results

The method was evaluated on both simulation (CAVE, KAIST, DynaSpec) and real-world data (DD-CASSI prototype).

Quantitative Performance:
- Reconstruction Quality: PG-SVRT achieved 41.52 dB PSNR on the DynaSpec test set, outperforming SOTA image-based methods (e.g., MST-L, CST-L, DPU) and video restoration baselines.
- Spectral Fidelity: It achieved the lowest Spectral Angle Mapper (SAM) score (3.9084), indicating superior recovery of spectral signatures compared to masked signals.
- Temporal Consistency: It achieved the best Spatio-Temporal Redundancy Error (ST-RRED) score (23.25), demonstrating significantly smoother intensity curves and less flickering than frame-by-frame methods.
- Efficiency: Despite being a video model, PG-SVRT maintained 28.18 GFLOPs, which is lower than or comparable to many image-based methods, thanks to the bridged token mechanism.
Qualitative Performance:
- Visual results show PG-SVRT recovers fine structural details and natural textures with minimal artifacts.
- In real-world experiments, PG-SVRT produced natural pseudo-RGB images and consistent spectral curves, whereas other methods exhibited distortions and striping.
Ablation Studies:
- Removing CDPA dropped PSNR by ~1.33 dB, confirming the importance of spatiotemporal propagation.
- The Bridged Token strategy (with $N_B=64$ ) provided the best trade-off between performance and computational complexity.
- The DD-CASSI architecture was proven superior to SD-CASSI and others for this specific task.

5. Significance

This work represents a foundational shift in Spectral Compressive Imaging from static image reconstruction to dynamic video reconstruction.

Scientific Impact: It solves the "temporal isolation" problem inherent in current SCI methods, enabling stable, high-fidelity spectral video perception.
Practical Application: The high performance and low computational cost make it viable for real-time applications in autonomous driving, industrial inspection, and surveillance where dynamic spectral analysis is required.
Community Resource: By releasing the DynaSpec dataset and the DD-CASSI prototype, the authors provide the community with essential tools to further advance high-dimensional spectral video research.