Communication-Inspired Tokenization for Structured Image Representations

Imagine you are trying to describe a complex painting to a friend over a very slow, unreliable radio connection. You only have a limited number of "words" (tokens) to send, and you need to make sure your friend can reconstruct the entire picture in their mind based on your description.

Most current AI systems try to do this by taking a giant, blurry snapshot of the whole painting, squishing it into a tiny file, and hoping the friend can guess the details. This works okay for simple things, but it often misses the big picture: what the objects are and how they relate to each other. The AI ends up remembering the texture of the grass better than the fact that there is a dog running on it.

COMiT (Communication-inspired Tokenization) is a new way of teaching AI to "talk" about images, inspired by how humans actually describe scenes to one another.

Here is the breakdown using simple analogies:

1. The Old Way: The "Blind Photographer"

Traditional AI tokenizers act like a photographer who takes a picture, immediately shrinks it down to a tiny thumbnail, and sends that thumbnail away.

The Problem: To make the file small, the AI has to throw away details. It often keeps the "grain" of the photo (the texture) but loses the "story" (the objects). It's like sending a friend a photo of a forest where they can see the leaves perfectly but can't tell if there's a bear hiding behind a tree.

2. The COMiT Way: The "Guided Tour Guide"

COMiT changes the game. Instead of sending a static thumbnail, it acts like a tour guide giving a live, step-by-step description of the scene.

The Process: Imagine the AI is looking at the image through a small window (a "crop").
1. It looks at the left side and says, "Okay, I see a red ball."
2. It moves the window to the right and says, "Now I see a blue dog."
3. It updates its mental note: "Red ball, blue dog."
4. It moves again, sees a tree, and updates the note: "Red ball, blue dog, green tree."

This is the "Attentive Sequential" part. The AI doesn't try to swallow the whole image at once. It builds the description piece by piece, just like a human speaker would.

3. The "Speaker and Listener" are the Same Person

In most AI systems, there is one brain for "speaking" (encoding the image) and a different brain for "listening" (reconstructing the image).

COMiT is unique because it uses one single brain to do both.

The Analogy: Imagine you are trying to memorize a scene for a test. You look at it, describe it to yourself, and then try to redraw it from memory. Then you check your drawing, see what you missed, look at the scene again, and refine your memory.
COMiT does this loop. It acts as the "speaker" building the message, and then immediately acts as the "listener" trying to draw the picture back from that message. Because it's the same brain doing both, it learns exactly how to organize the information so it's easy to remember and easy to redraw.

4. The "Game" of Reconstruction

The training process is like a game of "Telephone" but with a twist.

The AI is given a noisy, blurry version of an image.
It has to "clean up" the image using the message it built from the tour guide steps.
If the message was messy (e.g., "red thing, blue thing, green thing" without knowing what they are), the AI fails to draw the picture correctly.
If the message is structured (e.g., "Red ball on the left, Blue dog on the right"), the AI can draw a perfect picture.
The AI gets punished (loses points) if the picture it draws doesn't match the original. Over millions of tries, it learns to organize its "words" so they make perfect sense.

5. Why This Matters: "Object-Centric" Thinking

The biggest win for COMiT is that it naturally learns to group things by objects, not just pixels.

Old AI: "Here are 100 pixels that look like fur."
COMiT: "Here is a token for 'Dog' and a token for 'Ball'."

Because the AI builds the message step-by-step, it learns that "Dog" is a distinct entity that can be moved around or related to other things. This makes the AI much better at reasoning. If you ask, "Is the dog chasing the ball?", a traditional AI might get confused because it just sees a blur of fur and red. COMiT knows exactly where the dog is and where the ball is, so it can answer the question correctly.

Summary

Think of COMiT as teaching an AI to write a story about an image instead of just taking a compressed photo.

It looks at the image in chunks.
It builds a mental list of objects.
It uses that list to redraw the image.
Because it practices this "storytelling" loop, it ends up with a much smarter, more organized understanding of the world, making it better at complex tasks like understanding relationships between objects or generalizing to new scenes.

In short: Old AI tries to shrink the image. COMiT tries to understand the story inside the image.

1. Problem Statement

Modern multimodal systems increasingly rely on discrete image tokenizers to convert images into sequential tokens compatible with Transformer architectures. However, existing approaches (e.g., VQ-VAE, TiTok, FlexTok) face two primary limitations:

Optimization Bias: They are primarily optimized for reconstruction and compression. Consequently, the learned tokens often capture local texture and patch statistics rather than high-level, object-centric semantic structures.
Semantic Entanglement: Even when using 1D token sequences, semantic information is often entangled across tokens, lacking the compositional organization required for downstream tasks like relational reasoning and compositional generalization.
Static Encoding: Most methods encode the entire image in a single pass, failing to mimic the incremental, compositional nature of human visual perception and communication.

The authors argue that to achieve structured, interpretable visual representations, tokenization must be coupled with an encoding procedure that explicitly encourages compositional organization and semantic grounding, rather than just compression.

2. Methodology: COMiT

The authors propose COMmunication inspired Tokenization (COMiT), a framework that treats image encoding as an iterative "communication-and-reconstruction" game. The core philosophy is inspired by human communication: a speaker observes a scene incrementally, integrating salient information into a message, while the listener (or the speaker themselves in a memory task) reconstructs the scene from that message.

Key Design Principles

Attentive and Sequential Tokenization:
- Instead of processing the whole image at once, the encoder observes a sequence of randomized local crops ( $c_k$ ) and their relative locations.
- At each step $k$ , the model updates a discrete latent message $m_k$ by integrating the new crop with the previous message state $m_{k-1}$ .
- This forces the model to decide greedily which information is essential to retain in the limited token budget, inducing a hierarchical structure.
Homogeneous Communication (Unified Architecture):
- Unlike traditional autoencoders with separate encoder/decoder networks, COMiT uses a single Transformer model for both encoding and decoding.
- The model acts as both the "speaker" (encoding crops into a message) and the "listener" (reconstructing the image from the message), mirroring the symmetry of human communication.

Technical Architecture & Training

Encoding Process:
- The input image is processed as a sequence of $K$ crops.
- The latent message $m$ is initialized with $L$ tokens from a vocabulary of size $N$ .
- At each step, the model takes the current crop, its relative offset, and the previous message to predict an updated message.
- The updated message is quantized using Finite Scalar Quantization (FSQ) to ensure discrete tokens, which are then fed back into the recurrent loop.
Decoding Process:
- The final discrete message $m_K$ conditions a Flow-Matching decoder.
- The decoder is trained to predict the velocity field of a marginal flow to reconstruct the full image from noise, conditioned on the message.
Loss Functions:
- Flow-Matching Loss ( $L_{FM}$ ): Standard conditional flow-matching loss for image reconstruction.
- Semantic Representation Alignment (SREPA): A distillation loss that aligns the intermediate representations of the COMiT message with features from a frozen, self-supervised vision model (DINOv2). This ensures the tokens carry semantic meaning.
- REPA: An additional alignment loss to speed up training by aligning spatial features.
Training Strategy:
- Randomized Crop Count: The number of crops $K$ is randomized during training. This prevents the model from pre-allocating specific token slots for specific crops, forcing it to use tokens efficiently and greedily.
- Stop-Gradient: Gradients are only backpropagated through the final message update to reduce computational cost and encourage greedy token usage.
- Classifier-Free Guidance (CFG): The model is trained to handle both conditional (with message) and unconditional (empty message) decoding to support CFG at inference.

3. Key Contributions

Novel Tokenization Paradigm: Shifts the focus from compression-reconstruction trade-offs to semantic organization via an iterative, communication-inspired encoding process.
Unified Encoder-Decoder: Introduces a single-network architecture that performs both sequential encoding and flow-based decoding, reducing architectural redundancy and parameter count.
Structured Token Emergence: Demonstrates that the attentive, sequential tokenization pipeline is critical for inducing object-centric tokens. Ablation studies show that without sequential local crops, tokens become diffuse and less interpretable.
Comprehensive Benchmarking: Proposes a new evaluation suite focusing on:
- Visual Recognition: ImageNet100 classification.
- Compositional Generalization: MSCOCO (testing generalization to unseen object pairs).
- Relational Reasoning: Visual Genome (testing subject-object-predicate relationships).

4. Results

The authors evaluated COMiT (in B, L, and XL sizes) against state-of-the-art 1D tokenizers (TiTok, ALIT, FlexTok, SelfTok).

Semantic Performance: COMiT significantly outperforms prior methods on semantic tasks.
- ImageNet100: COMiT-L achieves 85.80% top-1 accuracy, compared to ~81.5% for the best FlexTok variant.
- MSCOCO (Compositional): COMiT-L achieves 45.31% top-5 accuracy, surpassing FlexTok (38.58%).
- Visual Genome (Relations): COMiT-L achieves 56.42% top-1 accuracy, outperforming FlexTok (54.46%).
Reconstruction: While the primary focus is semantics, COMiT maintains competitive reconstruction quality (rFID and PSNR), though it is slightly lower than methods purely optimized for generation (e.g., SelfTok).
Ablation Insights:
- SREPA: Removing semantic alignment drops ImageNet100 accuracy from 82.91% to 72.26%, proving the necessity of semantic grounding.
- Attentive Tokenization: Removing local crops (training only on global crops) reduces mIoU (token-object alignment) from 0.53 to 0.34, confirming that the sequential process is key to structuring tokens around objects.
Test-Time Flexibility: The model supports adaptive cropping policies at inference. Adding local crops improves compositional generalization, while a single global crop offers the best trade-off for speed and performance.

5. Significance and Future Work

Interpretability: COMiT produces token sequences where individual tokens naturally align with specific objects or object parts, making the latent space highly interpretable.
Multimodal Potential: The structured, object-centric nature of COMiT tokens makes them ideal for multimodal systems requiring compositional reasoning (e.g., visual question answering, robotic manipulation).
Future Directions: The authors suggest extending COMiT to video to handle temporal redundancy and motion, and exploring adaptive tokenization via reinforcement learning where the model learns to select crops dynamically based on the task.

In summary, COMiT represents a paradigm shift in visual tokenization, moving away from static compression toward dynamic, semantically grounded, and compositional representation learning, effectively bridging the gap between low-level image reconstruction and high-level semantic understanding.