Geometric Transformation-Embedded Mamba for Learned Video Compression

Imagine you are trying to send a massive, high-definition movie to a friend over a slow internet connection. You need to shrink the file size (compress it) without making the movie look like a blurry, pixelated mess.

For decades, the standard way to do this has been like sending a puzzle. You send the first picture, and then for every subsequent picture, you only send the changes (the puzzle pieces that moved) and a map telling the receiver where to move them. This is called "motion estimation." It works, but it's complicated, like trying to describe a dance move by listing every single muscle twitch.

This paper introduces a new, smarter way to shrink video files. Instead of sending a puzzle with a map, they invented a "Magic Shrinker" that understands the movie as a whole, flowing story. Here is how they did it, broken down into simple concepts:

1. The Core Idea: Stop Counting Steps, Start Reading the Story

Traditional methods try to calculate exactly how every pixel moves from one frame to the next. It's like trying to describe a river by measuring the speed of every single water droplet.

The authors say: "Why not just read the story?"
They use a new type of AI brain (based on something called Mamba) that doesn't just look at one frame and guess the next. Instead, it scans the video in four different directions (forward, backward, up, down) simultaneously.

The Analogy: Imagine reading a book. Old methods read one page, stop, calculate how the characters moved, and then read the next. This new method reads the whole chapter at once, understanding the flow of the story from start to finish, so it knows exactly what to expect next without needing a map.

2. The "Cascaded Mamba" (The Time-Traveling Scanner)

The paper introduces a "Cascaded Mamba Module." Think of this as a super-sleuth that looks at the video in four different "time-travel" modes:

Forward: Watching the movie normally.
Backward: Rewinding the movie.
Fast-Forward/Reverse: Looking at the same spot in the room across different times.

By scanning the video in all these directions at once, the AI captures long-range dependencies. It understands that a cloud moving slowly across the sky in frame 1 is the same cloud in frame 50, even if it's far away. This helps it predict what comes next with incredible accuracy, so it doesn't need to send as much data.

3. The "Locality Refinement" (The Detail-Oriented Artist)

While the "Time-Traveling Scanner" is great at seeing the big picture, it sometimes misses tiny details (like the texture of a car's paint or the leaves on a tree).

To fix this, the authors added a Locality Refinement Feed-Forward Network (LRFFN).

The Analogy: If the Mamba module is the director who sees the whole movie scene, the LRFFN is the makeup artist. It zooms in on specific spots and uses special "difference brushes" (Difference Convolutions) to focus only on the changes and edges. It ignores the boring, flat parts and only sharpens the interesting details. This ensures the video stays crisp even when the file size is tiny.

4. The "Conditional Entropy Model" (The Smart Predictor)

Finally, there's the question of how to pack the data into the smallest possible box. This is where the Conditional Channel-wise Entropy Model comes in.

The Analogy: Imagine you are packing a suitcase. A normal packer just throws clothes in randomly. A smart packer knows, "Oh, I just packed a heavy winter coat, so the next item is probably a light shirt."
This model looks at the video frames that just arrived and uses them to guess what the current frame will look like. It creates a "prediction" (a pseudo-ground truth) of the motion. Because it can predict so well, it only needs to send the tiny "mistakes" in the prediction, rather than the whole picture.

Why is this a big deal?

Simpler: It gets rid of the complex "motion estimation" math that has been the standard for 30 years.
Smarter: It sees the video as a continuous flow of time and space, not just a stack of static images.
Better Quality: At low data limits (like streaming on a bad connection), this method keeps the video looking sharp and smooth, without the "blocky" artifacts you usually see.

In a nutshell: The authors replaced the old, complicated "puzzle and map" system with a smart, multi-directional scanner that reads the video's story, focuses on the fine details, and predicts the future so well that it only needs to send the bare minimum of data to recreate a beautiful movie.

Here is a detailed technical summary of the paper "Geometric Transformation-Embedded Mamba for Learned Video Compression" (GTEM-LVC).

1. Problem Statement

Current learned video compression (LVC) methods predominantly follow a hybrid coding paradigm (similar to traditional codecs like H.264/HEVC). These methods rely on explicit motion estimation and compensation, residual coding, and complex motion vector coding. While effective, these approaches introduce significant architectural complexity and computational overhead.

Conversely, transform-based video compression methods (which use nonlinear transforms, quantization, and entropy coding without explicit motion estimation) have emerged as a simpler alternative. However, existing transform-based approaches face two main limitations:

Limited Long-Range Modeling: Early methods using 3D convolutions suffer from local receptive fields, failing to capture long-range spatial and temporal dependencies.
Insufficient Temporal Priors: Recent methods using Transformers or conditional entropy models often rely solely on previously decoded latent features, which is insufficient to fully characterize complex temporal dependencies, leading to suboptimal compression performance, especially at low bitrates.

2. Methodology

The authors propose GTEM-LVC, a streamlined, frame- and latent-dependent transform-based framework. The core architecture consists of three main innovations:

A. Cascaded Mamba Module (CMM) with Geometric Transformations

To address the need for global context modeling without the computational cost of multi-directional parallel scanning, the authors introduce a Geometric Transformation Mamba Block (GTMB).

Mechanism: Instead of scanning in multiple directions simultaneously, the GTMB applies a reversible geometric transformation (e.g., flipping, transposing) to the video features before performing a single-direction selective scan. This allows the model to indirectly capture dependencies in different directions while maintaining reversibility to restore the original spatio-temporal structure.
Cascaded Scanning: The CMM chains four GTMBs, each utilizing a specific scanning strategy:
1. FST: Forward Spatio-Temporal (Identity transform).
2. BST: Backward Spatio-Temporal (Flip temporal and spatial dimensions).
3. FTS: Forward Temporal-Spatial (Transpose spatial and temporal dimensions).
4. BTS: Backward Temporal-Spatial (Flip and transpose).
Goal: This effectively captures comprehensive long-range dependencies across both spatial and temporal dimensions.

B. Locality Refinement Feed-Forward Network (LRFFN)

To complement the global modeling of the CMM, the LRFFN focuses on local spatial dependencies and fine-grained details.

Hybrid Convolution Block (HCB): The core of the LRFFN is an HCB composed of five parallel convolution operations:
- Vanilla Convolution (captures intensity).
- Vertical, Horizontal, and Angular Difference Convolutions (capture variations between neighboring values).
- Central Difference Convolution.
Function: Difference convolutions provide a more compact representation of local variations, requiring fewer bits. The LRFFN uses a two-branch structure where the HCB output modulates the main feature stream to reduce channel redundancy.

C. Conditional Channel-wise Entropy Model (CCEM)

This module improves the estimation of probability distributions for latent features to enhance coding efficiency.

Dual-Conditioning: Unlike prior works that use only past decoded latents, the CCEM utilizes:
1. Previously Decoded Latents: $\bar{y}_{t-2}$ and $\bar{y}_{t-1}$ .
2. Pseudo-Latent Features: Generated via a Predictive Motion Alignment (PMA) module and a Condition Generation Network (CGN).
Predictive Motion Alignment: Since the current latent $y_t$ is unavailable for motion estimation, the model estimates motion between $t-2$ and $t-1$ , corrects it, and uses it to align features for the current frame.
Condition Generation: A network combining residual blocks and Swin-Transformer attention blocks fuses the aligned features and past latents to generate a conditional prior $c_t$ for the entropy model.

3. Key Contributions

Novel Framework: A simple yet effective transform-based video compression method that eliminates complex explicit motion estimation/compensation while achieving competitive perceptual quality and temporal coherence.
Cascaded Mamba Module (CMM): A novel architecture embedding geometric transformations into Mamba blocks to efficiently capture non-local spatial and temporal dependencies via bidirectional scanning.
Locality Refinement FFN (LRFFN): A specialized network using hybrid difference convolutions to model local dependencies and preserve fine-grained details.
Advanced Entropy Model: A conditional channel-wise entropy model that leverages both past decoded features and pseudo-aligned motion features to accurately estimate latent distributions.

4. Experimental Results

The method was evaluated on standard benchmarks (REDS4, UVG, MCL-JCV) against state-of-the-art (SOTA) methods including DCVC series, DHVC, and GLC-video.

Perceptual Quality: GTEM-LVC outperforms hybrid coding methods and other transform-based approaches in perceptual metrics (LPIPS and DISTS), particularly at low bitrates. It produces visually pleasing results with better structural detail preservation (e.g., streetlights, bridges) compared to the over-smoothed outputs of competitors.
Temporal Consistency: The method achieves the lowest tLPIPS (temporal LPIPS) among all compared methods, indicating superior temporal stability and fewer artifacts in video sequences.
Pixel-Level Accuracy: While prioritizing perceptual quality, it maintains competitive PSNR and MS-SSIM scores, significantly outperforming purely perceptual methods like ICISP which often sacrifice pixel fidelity.
Efficiency:
- Parameters: ~47.79M parameters, comparable to hybrid coding methods and significantly fewer than GLC-video (289M).
- Speed: Encoding/Decoding times are competitive, though the entropy model remains the primary computational bottleneck.

5. Significance

This paper represents a significant shift in learned video compression by demonstrating that explicit motion estimation is not strictly necessary for high-performance compression. By leveraging State Space Models (Mamba) with geometric transformations, the authors successfully bridge the gap between the efficiency of transform-based coding and the long-range dependency modeling capabilities of modern deep learning architectures.

The work suggests that geometric transformations can effectively replace complex multi-directional scanning mechanisms, offering a more efficient path for capturing spatio-temporal correlations. Furthermore, the integration of difference convolutions and pseudo-motion priors provides a robust blueprint for future research in low-bitrate, high-fidelity video compression.