ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

Imagine you are trying to send a high-definition photo of a sunset to a friend, but you are stuck in a place with terrible internet—like a satellite phone in the middle of a forest or a remote mountain. You need to send the image, but the connection is so slow that sending the whole file would take forever.

The Problem with Current Methods:
Most modern image compression tools are like heavy, expensive delivery trucks. They are great at packing a lot of stuff efficiently, but they are too big and slow to drive on narrow, bumpy roads (low-bandwidth networks).

Traditional methods (like JPEG) try to shrink the file by throwing away details, resulting in blurry, blocky images.
New "Generative" methods (AI that "imagines" the missing details) produce beautiful, sharp images, but they require massive supercomputers to run. They are like trying to drive a Ferrari on a dirt path; it's too heavy and complex for the job.

The Solution: ProGIC
The authors of this paper propose ProGIC (Progressive Generative Image Compression). Think of ProGIC not as a delivery truck, but as a smart, modular LEGO set that can be built piece by piece.

Here is how it works, using three simple analogies:

1. The "Sketch-to-Painting" Analogy (Progressive Decoding)

Imagine an artist drawing a portrait.

Old Way: You have to wait until the artist finishes the entire painting before you can see anything. If the internet cuts out halfway through, you get nothing.
ProGIC Way: The artist starts with a rough sketch (the base layer). You can see the face immediately! Then, they add shading (the second layer). Now you can see the lighting. Finally, they add fine details like eyelashes and skin texture (the final layers).
Why it matters: With ProGIC, as soon as the first few bytes of data arrive, your phone shows a usable, low-quality preview. As more data trickles in, the image gets sharper and clearer. You don't have to wait for the whole file to see what you're looking at.

2. The "Residual Vector Quantization" (RVQ) Analogy

How does the AI know what to draw at each step without sending the whole picture? It uses a technique called Residual Vector Quantization (RVQ).

Think of it like a dictionary of shapes.
Step 1: The AI looks at the image and says, "Okay, the general shape is a circle." It sends the code for "Circle."
Step 2: The AI looks at what's missing (the "residual"). It says, "The circle is a bit off-center and has a bump." It sends the code for "Bump."
Step 3: It looks at the tiny details. "There's a speck of dust." It sends the code for "Dust."
Instead of sending the whole image, it sends a sequence of small codes that add up to the final picture. This allows the receiver to stop at any point and still have a recognizable image.

3. The "Lightweight Backpack" Analogy (Efficiency)

Most AI image tools are like heavy hiking backpacks filled with bricks (massive computer models). They need powerful computers (GPUs) to carry them.

ProGIC's Innovation: The authors built a lightweight backpack using "depthwise-separable convolutions." Imagine replacing those heavy bricks with feather-light foam.
The Result: This backpack is so light that it can be carried by a hiker with no gear (a standard mobile phone or a laptop CPU) without breaking a sweat. It runs 10 times faster than the heavy competitors, making it possible to compress and decompress images instantly on your phone, even without a powerful graphics card.

The Real-World Impact

The paper demonstrates this in a satellite communication scenario (like a forest fire response team):

Scenario: A ranger sees a fire and needs to send a photo to headquarters. The satellite link is slow and sends data in tiny chunks every 60 seconds.
Without ProGIC: The ranger waits 5 minutes for the full image to download. By then, the fire might have spread.
With ProGIC:
- Second 0-60: A blurry, low-res image appears. "I see smoke!"
- Second 60-120: The image gets clearer. "I see the fire is near the river."
- Second 120+: The image is sharp. "I see the exact location of the flames."
- Result: The team can react immediately, even while the image is still "loading."

Summary

ProGIC is a new way to send images that is:

Progressive: You see a rough draft immediately, and it gets better as data arrives (no more "loading..." spinners).
Lightweight: It runs fast on regular phones and laptops, not just supercomputers.
High Quality: It uses AI to "hallucinate" (guess) missing details intelligently, so the image looks great even when the file size is tiny.

It's the difference between waiting for a slow, heavy truck to deliver a package, versus receiving a live video feed that gets clearer the longer you watch.

1. Problem Statement

Generative Image Compression (GIC) has emerged to address the "perception-distortion" gap, where traditional codecs (optimized for pixel-level MSE) produce blurry images and blocking artifacts at low bitrates. While recent GIC methods using GANs, VQ-GANs, and Diffusion models have improved perceptual quality, they face two critical limitations:

Lack of Progressive Decoding: Most existing GICs require the complete bitstream to generate a usable image. In bandwidth-constrained scenarios (e.g., satellite communications, emergency response), waiting for the full transmission delays situational awareness.
High Computational Cost: State-of-the-art GICs often rely on massive models (e.g., Diffusion models with billions of parameters or large VQ-GANs), making them unsuitable for edge devices (CPUs, mobile phones) and resulting in slow inference times.

The core challenge is to design a codec that offers progressive transmission (usable previews from partial data) and lightweight efficiency while maintaining high perceptual quality.

2. Methodology: ProGIC

The authors propose ProGIC, a compact generative codec built upon Residual Vector Quantization (RVQ) and a lightweight neural architecture.

A. Core Architecture: Residual Vector Quantization (RVQ)

Unlike traditional single-codebook VQ, ProGIC decomposes the image latent representation into a sequence of quantized residual vectors.

Mechanism: The latent vector $y$ is approximated as a sum of a base vector and a sequence of residuals:
$\hat{y} = \hat{y}_1 + \sum_{i=1}^{N-1} \hat{r}_i$
where $\hat{y}_1$ is the base quantization, and $\hat{r}_i$ are residuals quantized by subsequent codebooks.
Progressive Decoding: This structure allows the decoder to reconstruct the image in stages. Using only the first $i$ codebooks yields a coarse-to-fine reconstruction. This enables immediate previews from partial bitstreams without waiting for the full transmission.
Bitrate Flexibility: The system supports multiple bitrates within a single model by stopping the decoding process at different stages ( $N=5$ codebooks in the main model).

B. Lightweight Backbone

To ensure deployment on resource-constrained devices, ProGIC replaces heavy standard convolutions with a specialized architecture:

Depthwise-Separable Convolutions: The encoder ( $g_a$ ) and decoder ( $g_s$ ) utilize stacks of depthwise convolution blocks and Feed-Forward Networks (FFNs), significantly reducing FLOPs and parameters compared to ResBlocks.
Small Attention Blocks: To compensate for the limited spatial aggregation of depthwise convolutions, small attention modules are inserted after downsampling and before upsampling. These capture long-range dependencies with minimal computational overhead.
Feature Modulation: To make the model aware of the current progressive stage, the decoder employs feature modulation (scaling and biasing) specific to each stage $i$ . This allows the same network weights to adaptively refine the image as more bits arrive.

C. Training Strategy

The model is trained with a multi-stage loss function that accumulates losses across all progressive stages $i \in [1, N]$ :
$\mathcal{L} = \sum_{i=1}^{N} \lambda_i \left( \|x - \hat{x}_i\|_1 + \lambda_{per} \mathcal{L}_{per} + \lambda_{adv} \mathcal{L}_{adv} + \lambda_{cb} \mathcal{L}_{cb} \right)$

Loss Components: Includes $L_1$ reconstruction, LPIPS (perceptual), adversarial (GAN), and codebook commitment losses.
Weighting ( $\lambda_i$ ): A weighting ratio $p$ balances the importance of intermediate stages vs. the final reconstruction. The authors set $p=0.5$ to ensure balanced performance across all bitrates.
Entropy Coding: The authors found that entropy coding (range coding) on the quantized indices provided negligible gains (<1% bitrate reduction) due to the high entropy (near-uniform distribution) of the RVQ indices, so it is omitted to simplify implementation and ensure cross-platform consistency.

3. Key Contributions

Progressive Generative Codec: ProGIC is the first GIC to successfully integrate RVQ for coarse-to-fine progressive decoding, enabling rapid image previews from partial bitstreams.
Lightweight Design: By combining RVQ with depthwise-separable convolutions and small attention blocks, ProGIC achieves a compact model size (33M parameters for the base, 14M for the small version) suitable for CPU and mobile deployment.
Efficiency and Performance: The method delivers 10x faster encoding and decoding speeds compared to diffusion-based SOTA methods (like MS-ILLM) while achieving superior perceptual quality.

4. Experimental Results

Experiments were conducted on four datasets: Kodak, Tecnick, DIV2K, and CLIC 2020 Professional, using metrics like LPIPS, DISTS, PSNR, and MS-SSIM.

Compression Performance:
- On the Kodak dataset, ProGIC achieves 57.57% BD-rate savings on DISTS and 58.83% on LPIPS compared to the strong baseline MS-ILLM.
- It outperforms other SOTA methods including DiffEIC, OSCAR, and Control-GIC across all datasets.
Speed and Efficiency:
- GPU: ProGIC is >10x faster in decoding than MS-ILLM and significantly faster than OSCAR.
- CPU/Mobile: ProGIC-s (small model) runs efficiently on laptop CPUs (AMD Ryzen 7840HS) and mobile SoCs (Snapdragon 870, MediaTek Dimensity 8000), achieving feasible encoding/decoding times (e.g., ~0.5s encoding for 256x256 on mobile).
Progressive Capability: Visualizations show that the first stage of ProGIC recovers main semantic content immediately, with details refining as subsequent stages arrive. This contrasts with non-progressive methods that either produce low-quality fixed images or require full transmission time.

5. Significance and Impact

Real-World Applicability: ProGIC bridges the gap between high-quality generative compression and practical deployment. Its ability to run on CPU-only mobile devices and satellite communication links makes it viable for remote monitoring, disaster response, and IoT applications where bandwidth and compute are scarce.
Paradigm Shift: It demonstrates that high-fidelity generative compression does not require massive diffusion models or complex entropy coding. Instead, a well-designed RVQ architecture with lightweight backbones can achieve state-of-the-art perceptual quality with superior speed and flexibility.
Use Case Validation: The paper provides a concrete simulation of a forest fire monitoring scenario using satellite short messages, proving that ProGIC can deliver immediate, usable situational awareness that non-progressive codecs cannot.