Trajectory-aware Shifted State Space Models for Online Video Super-Resolution

Imagine you are trying to watch a live stream of a soccer game, but the connection is bad, and the video is blurry and pixelated. You want to see the players clearly, but you can't wait for the whole game to finish downloading to fix it; you need the picture to get better right now, as the game is happening. This is the challenge of Online Video Super-Resolution (VSR).

The paper you shared introduces a new AI model called TS-Mamba that solves this problem. Here is how it works, explained with some everyday analogies.

The Problem: The "One-Neighbor" Limit

Most existing video enhancers are like a person trying to fix a blurry photo by only looking at the one picture immediately before it.

The Analogy: Imagine you are trying to guess what a person in a crowd is doing. If you only look at the person standing right next to them, you might miss the fact that they are waving at someone three rows back.
The Issue: Old methods only look at the "immediate neighbor" frame. They miss the long-term context (like a player running from the other side of the field), which makes the final image look a bit shaky or incomplete.

The Solution: TS-Mamba (The "Trajectory Detective")

The authors created a new system called TS-Mamba. Think of it as a super-smart detective that doesn't just look at the person next to you, but tracks the entire path the person has taken.

Here are the three main tricks TS-Mamba uses:

1. Drawing the "Path" (Trajectory Awareness)

Instead of just grabbing the frame before the current one, TS-Mamba draws invisible lines (trajectories) across the video to see where objects have been moving over time.

The Analogy: Imagine a game of "Connect the Dots." Instead of just looking at the dot right next to the current one, the AI draws a line back through the last 15 dots to see the full curve of the movement.
The Result: It finds the most similar "pieces" (tokens) from the past that match the current moment, even if they were far away in time. It picks the best clues to reconstruct the picture.

2. The "Shifted" Scanner (Fixing the Broken Puzzle)

The AI uses a technology called Mamba (a type of State Space Model) which is incredibly fast and efficient. However, Mamba has a quirk: it reads images like a snake slithering through a grid (called a Hilbert scan).

The Problem: When a snake slithers, it sometimes jumps from the bottom of one block to the top of the next, breaking the flow. It's like reading a book where the sentences are cut in half and the second half is on a different page. This causes "spatial discontinuity" (the image looks a bit chopped up).
The Fix: The authors invented "Shifted SSMs." Imagine you are reading that book, but every time the sentence breaks, you shift the page slightly before reading the next part. This ensures the sentence flows smoothly.
The Analogy: It's like a construction crew that notices a gap in a brick wall. Instead of just laying more bricks, they slide the whole row over slightly to fill the gap perfectly, making the wall solid and continuous.

3. The "Smart Loss" (The Teacher's Red Pen)

To make sure the AI draws the paths correctly, the authors created a special "loss function" (a way to grade the AI's homework).

The Analogy: Usually, teachers only grade the final essay. Here, the teacher also grades the outline the student drew before writing. If the outline (the trajectory) is wrong, the essay (the video frame) will be messy. This forces the AI to learn how to track movement accurately from the very beginning.

Why is this a Big Deal?

Speed vs. Quality: Usually, you have to choose between a fast video (low quality) or a high-quality video (slow, laggy). TS-Mamba is like a Formula 1 car that also gets 100 miles per gallon. It is incredibly fast (real-time) but produces the highest quality picture.
Efficiency: It uses 22.7% less computing power than the current best methods. This means it can run on your phone or laptop without draining the battery or overheating the device.

The Bottom Line

TS-Mamba is a new way to make live video look crystal clear. It does this by:

Tracking movement over a long period (not just the last second).
Smoothing out the reading process so the image doesn't look chopped up.
Doing it all very quickly so you don't have to wait.

It's a major step forward for live streaming, video calls, and watching sports, ensuring that even with a shaky internet connection, you still get a sharp, clear picture.

1. Problem Statement

Online Video Super-Resolution (VSR) aims to restore high-resolution (HR) video frames in real-time using only the current low-resolution (LR) frame and temporally previous frames. This is critical for applications like live streaming and video conferencing.

Key Challenges:

Temporal Limitations: Most existing online VSR methods rely on Convolutional Neural Networks (CNNs) and typically utilize only a single previous frame for temporal alignment. This restricts their ability to model long-range temporal dependencies, limiting reconstruction quality.
Computational Complexity: While methods using long-range modeling (e.g., Transformers, Diffusion models) or bidirectional propagation offer better quality, they suffer from high computational complexity and latency, making them unsuitable for real-time online applications.
Spatial Continuity in SSMs: Recently, State Space Models (SSMs), specifically Mamba, have been introduced for vision tasks due to their linear computational complexity and global receptive field. However, standard Mamba implementations convert 2D images into 1D tokens via scanning (e.g., Hilbert scanning), which inherently causes spatial discontinuity and loss of local spatial continuity, degrading performance in image/video restoration.

2. Methodology: TS-Mamba

The authors propose TS-Mamba, a novel online VSR framework that combines long-term trajectory modeling with low-complexity Mamba to achieve efficient spatio-temporal aggregation.

A. Trajectory Construction and Token Selection

Instead of using all previous frames or a fixed window, TS-Mamba constructs trajectories within the video to identify the most relevant information.

Token Generation: The current LR frame and previous frames are processed to generate feature tokens.
Trajectory Formulation: Trajectories are defined as sequences of coordinates across frames.
Similar Token Selection: For the current frame's tokens, the model selects the $s$ most similar tokens from previous frames along these trajectories using cosine similarity. This allows the model to focus on long-range temporal information without processing irrelevant frames.

B. Trajectory-Aware Shifted Mamba Aggregation (TSMA)

The core innovation is the TSMA module, designed to aggregate the selected tokens while mitigating the spatial discontinuity inherent in SSM scanning.

The Problem: Standard Hilbert scanning creates "intra-window" and "inter-window" discontinuities where adjacent pixels in the 2D space are far apart in the 1D token sequence.
The Solution ("Scan-Shift-Scan"): The authors propose a specific processing flow:
1. Scan: Perform an initial Hilbert scan.
2. Shift: Apply specific shift operations (e.g., Up, Left, Right, Down by specific positions) to the local windows.
3. Scan: Perform a second Hilbert scan on the shifted windows.
Mechanism: By combining four types of Hilbert scannings with specific shift operations (e.g., $P(1, U(1), 3)$ ), the model compensates for the discontinuities lost in the first scan.
Architecture: The TSMA module uses a "Scan-Shift-Scan" manner with two parallel branches:
- Intra-window Compensation Branch (IntraWCB): Uses shifts like $U(1)$ to fix local discontinuities.
- Inter-window Compensation Branch (InterWCB): Uses shifts like $UL(3)$ to fix gaps between windows.
- These branches are combined with a standard SSM block and a Deformable Attention Block (DAB) to ensure robust feature aggregation.

C. Selective Scanning (SS3D)

The model employs Spatial Hilbert-based Selective Scanning along the Temporal dimension (SS3D). This converts spatio-temporal neighboring pixels into a 1D token sequence, allowing the Mamba block to capture long-term spatio-temporal characteristics while preserving local spatial information.

D. Loss Function

Spatial Loss: Charbonnier loss for image reconstruction quality.
Trajectory-Aware Loss ( $L_{trj}$ ): A novel loss function that supervises the trajectory generation process. It ensures the generated trajectories in the LR domain align with the downsampled trajectories of the HR ground truth, optimizing the accuracy of token selection during training.

3. Key Contributions

First SSM-based Online VSR: TS-Mamba is the first model to apply State Space Models (Mamba) to online VSR, aggregating long-term spatio-temporal information at the token level, unlike existing CNN-based methods limited to single-frame alignment.
Trajectory-Aware Token Selection: It introduces video trajectories to select the most similar tokens from previous frames, enabling efficient long-range modeling without the overhead of processing all frames.
Shifted SSMs Blocks: The authors design novel Shifted SSMs blocks based on Hilbert scanning and specific shift operations. This effectively compensates for intra-window and inter-window scanning losses, significantly strengthening the spatial continuity of Mamba.
Efficiency: The method achieves state-of-the-art performance with a 22.7% reduction in MACs (Multiply-Accumulate operations) compared to other SOTA online VSR models.

4. Experimental Results

The method was evaluated on three standard datasets: REDS4, Vid4, and Vimeo-90K-T, under both Bicubic (BI) and Blur (BD) degradations.

Performance: TS-Mamba achieves State-of-the-Art (SOTA) performance in PSNR and SSIM among online VSR methods, outperforming benchmarks like BasicVSR++, FDAN, KSNet, and TMP.
Efficiency:
- Complexity: Reduces MACs by over 22.7% compared to the next best online methods.
- Speed: Achieves an inference speed of 33.5 FPS on 180x320 LR frames, making it suitable for real-time applications (720p at 24+ FPS).
Ablation Studies:
- Removing trajectory generation or the trajectory-aware loss significantly drops performance.
- Removing the shift operations (IntraWCB/InterWCB) confirms that the "Scan-Shift-Scan" mechanism is crucial for recovering spatial continuity and boosting PSNR.
- The optimal number of selected tokens ( $s$ ) was found to be 3, balancing complexity and performance.

5. Significance

This paper bridges the gap between high-efficiency State Space Models and real-time video processing.

Theoretical Impact: It addresses a fundamental flaw in applying Mamba to 2D vision tasks (spatial discontinuity) by introducing a mathematically grounded "Shift-Scan" mechanism.
Practical Impact: By enabling long-range temporal modeling with linear complexity, TS-Mamba allows for high-quality video restoration in resource-constrained, real-time environments (e.g., live broadcasting), where previous high-quality methods were too computationally expensive.
Future Direction: It opens a new avenue for using SSMs in video tasks, suggesting that trajectory-guided token selection and scanning optimization are key to unlocking the full potential of Mamba in spatio-temporal domains.