Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

This paper proposes a novel visual representation framework that encodes signals as functions parametrized by low-rank adaptations on frozen diffusion models, enabling compact storage via single-vector hashing and bridging visual compression with generation through inference-time scaling and control.

Jiajun He, Zongyu Guo, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you have a massive, incredibly talented artist who has spent their entire life studying every painting, photograph, and movie ever made. This artist knows exactly how light hits a leaf, how water ripples, and how a smile forms. They are a "Foundation Model" of visual knowledge.

Now, imagine you want to send a specific, unique video of your cat playing with a laser pointer to a friend.

The Old Way: Sending the Whole Painting

Traditionally, to send this video, you would have to take a photo of every single frame, break it down into millions of tiny colored dots (pixels), and send that massive file. Even with compression, it's like mailing a giant, heavy crate full of bricks. You are sending the result (the pixels), not the idea of the cat.

The New Way: Sending the "Recipe"

This paper proposes a radical new idea: Don't send the video. Send the instructions on how to recreate it.

Think of it like this: Instead of mailing a fully baked cake to your friend, you mail them a tiny, secret recipe card. Your friend already has a world-class bakery (the AI model) and all the ingredients. They just need your specific recipe to bake your cake.

Here is how the paper's method works, broken down into simple steps:

1. The "Secret Recipe" (Implicit Representation)

The authors realized that the AI artist already knows 99% of what a video looks like. They don't need to be told "a cat has fur" or "the sun is bright." They just need to know what makes this specific cat video different from the millions of other cat videos they've seen.

They treat the video not as a file, but as a mathematical function (a recipe). They tweak the AI's internal settings just enough to make it "dream" your specific video. These tiny tweaks are called LoRA (Low-Rank Adaptation).

  • Analogy: Imagine the AI is a giant, complex piano. The video isn't a recording of the music; it's a tiny, specific set of instructions on which keys to press slightly harder or softer to play your song.

2. The "Magic Zipper" (One-Vector Compression)

Usually, these "recipe" instructions are still a bit big. The authors found a clever trick to shrink them down even further. They use a technique called hashing to squeeze all those tiny instructions into a single, tiny vector (a list of numbers).

  • Analogy: It's like taking a 100-page cookbook and compressing it into a single, tiny QR code. When your friend scans that QR code, their world-class bakery instantly knows exactly how to bake your cake.
  • The Result: A whole video (81 frames) is compressed into a single, tiny data packet. This is "extremely low bitrate" compression.

3. The "Chef's Touch" (Inference-Time Scaling)

Here is the coolest part. Because you aren't sending a static file, but a recipe, you can change how the cake is baked after you've sent the recipe.

If the first version of the cake isn't perfect, your friend (the decoder) can use the same recipe but try baking it a few different ways, taste-test the results, and pick the best one. This is called Inference-Time Scaling.

  • Analogy: It's like sending a recipe to a friend, but telling them, "If the cake looks a bit dry, try adding a little more vanilla. If it's too sweet, add a pinch of salt." You can refine the quality on the fly without needing to send a new file.

4. The "Memory Bank" (Visual Memory)

Because this "recipe" is attached to the AI's brain, it acts like a visual memory.

If you send the recipe for your cat, the AI now "remembers" your cat. Later, you can ask the AI to draw your cat wearing a hat, or your cat in a forest, simply by changing the text prompt. The AI doesn't need the original video anymore; it has the "memory" of your cat stored in those tiny adjustments.

  • Analogy: It's like teaching a friend a specific dance move. Once they learn it, they can do that move in any song, in any style, without you having to show them the video again.

Why is this a big deal?

  1. Super Small Files: You can send high-quality videos using almost no data (like sending a text message instead of a movie file).
  2. Smarter Compression: It doesn't just shrink the file; it understands the meaning of the video.
  3. Flexible: You can improve the quality or edit the video after it's been compressed, which is impossible with traditional video files.

In short: This paper turns video compression from "sending a heavy box of bricks" into "sending a tiny, magical instruction card that tells a super-smart artist exactly how to recreate your world."