Communication-Inspired Tokenization for Structured Image Representations

This paper introduces COMiT, a communication-inspired tokenization framework that iteratively refines discrete visual tokens through a recurrent transformer to produce structured, object-centric image representations that significantly enhance compositional generalization and relational reasoning compared to existing reconstruction-focused methods.

Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you are trying to describe a complex painting to a friend over a very slow, unreliable radio connection. You only have a limited number of "words" (tokens) to send, and you need to make sure your friend can reconstruct the entire picture in their mind based on your description.

Most current AI systems try to do this by taking a giant, blurry snapshot of the whole painting, squishing it into a tiny file, and hoping the friend can guess the details. This works okay for simple things, but it often misses the big picture: what the objects are and how they relate to each other. The AI ends up remembering the texture of the grass better than the fact that there is a dog running on it.

COMiT (Communication-inspired Tokenization) is a new way of teaching AI to "talk" about images, inspired by how humans actually describe scenes to one another.

Here is the breakdown using simple analogies:

1. The Old Way: The "Blind Photographer"

Traditional AI tokenizers act like a photographer who takes a picture, immediately shrinks it down to a tiny thumbnail, and sends that thumbnail away.

  • The Problem: To make the file small, the AI has to throw away details. It often keeps the "grain" of the photo (the texture) but loses the "story" (the objects). It's like sending a friend a photo of a forest where they can see the leaves perfectly but can't tell if there's a bear hiding behind a tree.

2. The COMiT Way: The "Guided Tour Guide"

COMiT changes the game. Instead of sending a static thumbnail, it acts like a tour guide giving a live, step-by-step description of the scene.

  • The Process: Imagine the AI is looking at the image through a small window (a "crop").
    1. It looks at the left side and says, "Okay, I see a red ball."
    2. It moves the window to the right and says, "Now I see a blue dog."
    3. It updates its mental note: "Red ball, blue dog."
    4. It moves again, sees a tree, and updates the note: "Red ball, blue dog, green tree."

This is the "Attentive Sequential" part. The AI doesn't try to swallow the whole image at once. It builds the description piece by piece, just like a human speaker would.

3. The "Speaker and Listener" are the Same Person

In most AI systems, there is one brain for "speaking" (encoding the image) and a different brain for "listening" (reconstructing the image).

COMiT is unique because it uses one single brain to do both.

  • The Analogy: Imagine you are trying to memorize a scene for a test. You look at it, describe it to yourself, and then try to redraw it from memory. Then you check your drawing, see what you missed, look at the scene again, and refine your memory.
  • COMiT does this loop. It acts as the "speaker" building the message, and then immediately acts as the "listener" trying to draw the picture back from that message. Because it's the same brain doing both, it learns exactly how to organize the information so it's easy to remember and easy to redraw.

4. The "Game" of Reconstruction

The training process is like a game of "Telephone" but with a twist.

  • The AI is given a noisy, blurry version of an image.
  • It has to "clean up" the image using the message it built from the tour guide steps.
  • If the message was messy (e.g., "red thing, blue thing, green thing" without knowing what they are), the AI fails to draw the picture correctly.
  • If the message is structured (e.g., "Red ball on the left, Blue dog on the right"), the AI can draw a perfect picture.
  • The AI gets punished (loses points) if the picture it draws doesn't match the original. Over millions of tries, it learns to organize its "words" so they make perfect sense.

5. Why This Matters: "Object-Centric" Thinking

The biggest win for COMiT is that it naturally learns to group things by objects, not just pixels.

  • Old AI: "Here are 100 pixels that look like fur."
  • COMiT: "Here is a token for 'Dog' and a token for 'Ball'."

Because the AI builds the message step-by-step, it learns that "Dog" is a distinct entity that can be moved around or related to other things. This makes the AI much better at reasoning. If you ask, "Is the dog chasing the ball?", a traditional AI might get confused because it just sees a blur of fur and red. COMiT knows exactly where the dog is and where the ball is, so it can answer the question correctly.

Summary

Think of COMiT as teaching an AI to write a story about an image instead of just taking a compressed photo.

  • It looks at the image in chunks.
  • It builds a mental list of objects.
  • It uses that list to redraw the image.
  • Because it practices this "storytelling" loop, it ends up with a much smarter, more organized understanding of the world, making it better at complex tasks like understanding relationships between objects or generalizing to new scenes.

In short: Old AI tries to shrink the image. COMiT tries to understand the story inside the image.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →