Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

Imagine you have a massive, incredibly talented artist who has spent their entire life studying every painting, photograph, and movie ever made. This artist knows exactly how light hits a leaf, how water ripples, and how a smile forms. They are a "Foundation Model" of visual knowledge.

Now, imagine you want to send a specific, unique video of your cat playing with a laser pointer to a friend.

The Old Way: Sending the Whole Painting

Traditionally, to send this video, you would have to take a photo of every single frame, break it down into millions of tiny colored dots (pixels), and send that massive file. Even with compression, it's like mailing a giant, heavy crate full of bricks. You are sending the result (the pixels), not the idea of the cat.

The New Way: Sending the "Recipe"

This paper proposes a radical new idea: Don't send the video. Send the instructions on how to recreate it.

Think of it like this: Instead of mailing a fully baked cake to your friend, you mail them a tiny, secret recipe card. Your friend already has a world-class bakery (the AI model) and all the ingredients. They just need your specific recipe to bake your cake.

Here is how the paper's method works, broken down into simple steps:

1. The "Secret Recipe" (Implicit Representation)

The authors realized that the AI artist already knows 99% of what a video looks like. They don't need to be told "a cat has fur" or "the sun is bright." They just need to know what makes this specific cat video different from the millions of other cat videos they've seen.

They treat the video not as a file, but as a mathematical function (a recipe). They tweak the AI's internal settings just enough to make it "dream" your specific video. These tiny tweaks are called LoRA (Low-Rank Adaptation).

Analogy: Imagine the AI is a giant, complex piano. The video isn't a recording of the music; it's a tiny, specific set of instructions on which keys to press slightly harder or softer to play your song.

2. The "Magic Zipper" (One-Vector Compression)

Usually, these "recipe" instructions are still a bit big. The authors found a clever trick to shrink them down even further. They use a technique called hashing to squeeze all those tiny instructions into a single, tiny vector (a list of numbers).

Analogy: It's like taking a 100-page cookbook and compressing it into a single, tiny QR code. When your friend scans that QR code, their world-class bakery instantly knows exactly how to bake your cake.
The Result: A whole video (81 frames) is compressed into a single, tiny data packet. This is "extremely low bitrate" compression.

3. The "Chef's Touch" (Inference-Time Scaling)

Here is the coolest part. Because you aren't sending a static file, but a recipe, you can change how the cake is baked after you've sent the recipe.

If the first version of the cake isn't perfect, your friend (the decoder) can use the same recipe but try baking it a few different ways, taste-test the results, and pick the best one. This is called Inference-Time Scaling.

Analogy: It's like sending a recipe to a friend, but telling them, "If the cake looks a bit dry, try adding a little more vanilla. If it's too sweet, add a pinch of salt." You can refine the quality on the fly without needing to send a new file.

4. The "Memory Bank" (Visual Memory)

Because this "recipe" is attached to the AI's brain, it acts like a visual memory.

If you send the recipe for your cat, the AI now "remembers" your cat. Later, you can ask the AI to draw your cat wearing a hat, or your cat in a forest, simply by changing the text prompt. The AI doesn't need the original video anymore; it has the "memory" of your cat stored in those tiny adjustments.

Analogy: It's like teaching a friend a specific dance move. Once they learn it, they can do that move in any song, in any style, without you having to show them the video again.

Why is this a big deal?

Super Small Files: You can send high-quality videos using almost no data (like sending a text message instead of a movie file).
Smarter Compression: It doesn't just shrink the file; it understands the meaning of the video.
Flexible: You can improve the quality or edit the video after it's been compressed, which is impossible with traditional video files.

In short: This paper turns video compression from "sending a heavy box of bricks" into "sending a tiny, magical instruction card that tells a super-smart artist exactly how to recreate your world."

Here is a detailed technical summary of the paper "Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models."

1. Problem Statement

Current visual generative models (e.g., diffusion models) possess rich visual knowledge learned from massive datasets. However, existing methods for storing or reusing visual content rely on explicit representations (pixels, latent variables, or tokens) that are external to the model. This separation creates inefficiencies:

Redundancy: The model must re-learn visual priors every time it processes an explicit signal.
Storage Limits: Traditional compression stores the signal itself, not the "knowledge" required to generate it.
Lack of Flexibility: Explicit codes are static; they cannot be easily refined or controlled during inference without re-encoding.

The paper proposes a paradigm shift: instead of compressing what a signal looks like, compress how to generate it by treating the visual signal as a function defined over a pre-trained generative model.

2. Methodology

The core framework, termed VOV (Vision/Video in One Vector), encodes visual signals as implicit representations (functions) via parameter-efficient fine-tuning (PEFT) of a frozen diffusion foundation model.

A. Implicit Visual Representation as Adaptation

Concept: A visual signal $x$ is represented not as data, but as a set of adaptation parameters that modify a pre-trained diffusion model to generate $x$ .
Mechanism: The authors use Low-Rank Adaptation (LoRA). Instead of fine-tuning the entire network, they optimize low-rank matrices ( $\Delta W = AB$ ) attached to the frozen backbone.
Training Objective: The model minimizes the flow-matching loss to learn a vector field $v_\theta$ that generates $x$ from noise. Theoretically, this minimizes the Kullback-Leibler (KL) divergence between the path measure of the adapted model and the pre-trained model, effectively finding the "simplest" function that deviates minimally from the base prior to reconstruct $x$ .

B. One-Vector Compression

To achieve extreme compression, the authors reduce the adaptation parameters to a single compact vector:

Hashing: All LoRA parameters across layers are mapped into a single shared vector $v \in \mathbb{R}^{1 \times k}$ using a fixed projection (hashing trick). This enforces parameter sharing and drastically reduces the parameter count.
Entropy Coding: The vector $v$ is quantized and entropy-coded. A learnable scale parameter and a factorized entropy model allow the vector to be compressed to approximately 1–3 bits per parameter.
Result: An entire video (e.g., 81 frames at 480p) or an image is compressed into a single vector (plus a text caption and negligible entropy parameters).

C. Inference-Time Scaling and Control

Because the representation is a function (the adapted model), it remains controllable after encoding:

Inference-Time Scaling: The encoder can use a shared pseudo-random number generator (PRNG) to branch the denoising trajectory, sampling multiple candidates at each step. It selects the best candidate using importance sampling based on the optimal denoising kernel (derived from the target $x$ ). The decoder deterministically reproduces this path using the shared PRNG and the single vector.
Distortion-Perception Trade-off: By stopping the generation process early (early stopping), users can trade off reconstruction fidelity (PSNR) for perceptual realism, or vice versa, without changing the compressed vector.

3. Key Contributions

Unified Framework: Proposes a framework where visual compression and generation are unified. The compressed representation acts as a "visual memory" that directly modulates the generation process of a foundation model.
One-Vector Representation: Demonstrates that complex visual signals (images and 81-frame videos) can be compressed into a single adaptation vector via LoRA hashing, achieving extremely low bitrates.
Inference-Time Scaling: Introduces a scaling strategy that improves reconstruction quality significantly with negligible bitrate overhead by leveraging the functional nature of the representation to perform importance sampling during decoding.
Functional Flexibility: Shows that these representations allow for downstream tasks like personalized editing, merging, and resolution changes simply by modifying the text prompt at inference time, without retraining.

4. Experimental Results

The method was evaluated on standard benchmarks (UVG, HEVC B/C/E datasets) against state-of-the-art codecs (H.265, H.266, DCVC-RT, GLC-Video).

Compression Performance:
- Achieves strong perceptual quality at extremely low bitrates (e.g., ~0.01 bpp).
- Outperforms existing neural and traditional codecs on perceptual metrics (DISTS, FVD, LPIPS).
- While PSNR is lower than traditional codecs (due to the stochastic nature of generative models), the visual structures and details are more coherent and less artifact-prone.
Inference-Time Scaling:
- Applying scaling (e.g., 218 samples per step) significantly boosts PSNR and perceptual metrics with only a marginal increase in bitrate (mostly on the encoding side).
- Scaling allows the model to recover fine details that the base vector might miss.
Editing Capabilities:
- The "visual memory" allows for zero-shot editing. Changing the text prompt (e.g., "blue flower" $\to$ "pink flower") successfully alters the generated content while preserving the identity and structure of the original signal.
- Demonstrated successful merging of multiple images into a single representation.

5. Significance and Future Implications

Redefining Compression: This work moves beyond "lossy compression" (approximating pixels) to "generative compression" (approximating the generation process). It leverages the massive knowledge embedded in foundation models to achieve high compression ratios.
Dynamic Content: Unlike static codecs, this method produces content that can be refined, edited, or scaled at inference time, bridging the gap between storage and interactive generation.
Efficiency: By compressing visual data into a single vector, it offers a new paradigm for transmitting visual information, potentially enabling new applications in streaming, AR/VR, and personalized media where bandwidth is constrained but generative power is available at the edge.

Limitations:

The quality is inherently bounded by the capacity of the base diffusion model (e.g., struggles with specific text rendering in videos).
Encoding time is currently slow due to the overfitting process (though amortized encoders are a proposed future direction).
Some biases from the base model can persist during editing (e.g., changing hair color might inadvertently change facial features due to training data correlations).