UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Imagine you are trying to teach a super-smart robot (a Large Language Model) to see, understand, and create art. The biggest problem is that computers "see" images as millions of tiny colored dots (pixels). Trying to teach a robot to predict every single dot one by one is like trying to describe a painting by listing the color of every single grain of sand on a beach. It's too slow, too expensive, and the robot gets confused.

To fix this, researchers use a Tokenizer. Think of a tokenizer as a translator that turns a complex image into a short, simple sentence of "code words" (tokens) that the robot can easily understand and predict.

However, existing translators have a problem: they are either great at describing the meaning of the image (good for understanding) but bad at recreating the picture (bad for drawing), or they are great at recreating the picture but lose all the meaning. It's like having a translator who can either write a beautiful poem about a sunset but can't draw the sun, or one who can draw a perfect sun but can't tell you what it feels like to watch it.

UniWeTok is the new translator that finally does both perfectly. Here is how it works, using some fun analogies:

1. The Massive Dictionary (The $2^{128}$ Codebook)

Imagine you have a dictionary. Most dictionaries have a few thousand words. UniWeTok has a dictionary so huge it has $2^{128}$ words.

The Analogy: If a normal dictionary is a small library, UniWeTok's dictionary is a library the size of the entire internet.
Why it matters: Because the dictionary is so big, each "word" (token) can hold a massive amount of information. One single word in UniWeTok can describe a complex texture, a specific face, or a whole scene, whereas other models need hundreds of words to say the same thing. This makes the robot much faster and more efficient.

2. The "Pre-Post" Study Session (Pre-Post Distillation)

To make sure the robot understands what it is seeing, the researchers used a "teacher-student" method.

The Analogy: Imagine a student (UniWeTok) trying to learn about a painting.
- Pre-Distillation: Before the student looks at the painting, they listen to an expert art critic describe the vibe and meaning of the piece.
- Post-Distillation: After the student tries to recreate the painting from memory, the expert checks their work and says, "You got the colors right, but you missed the emotion."
The Result: By listening to the expert before and after the task, the student learns to capture both the visual details and the deep meaning simultaneously.

3. The "Generative" Coach (Generative-Aware Prior)

Usually, models learn to understand images, but they forget how to create them.

The Analogy: Imagine a chef who is great at tasting food (understanding) but has never cooked a meal (generating). UniWeTok hires a "Generative Coach" who whispers to the chef during practice: "Hey, remember, you're going to have to cook this later, so keep the ingredients fresh and organized."
The Result: The model learns to organize the image data in a way that makes it easy to generate new images later, without sacrificing its ability to understand the current image.

4. The Hybrid Engine (Convolution-Attention + SigLu)

The brain of UniWeTok is built differently.

The Analogy: Think of looking at a city.
- Convolution is like looking at the bricks and mortar of individual buildings (local details).
- Attention is like looking at the skyline to see how the whole city fits together (global context).
- UniWeTok uses both at the same time.
The SigLu Activation: This is a special "brake" system. In previous models, trying to learn details and meaning at the same time caused the brain to get confused and crash (optimization conflict). SigLu acts like a smart governor that keeps the engine running smoothly, ensuring the model doesn't get overwhelmed by the massive dictionary.

5. The Three-Stage Training (Curriculum Learning)

You don't teach a child to read by starting with a PhD thesis. You start with picture books, then chapter books, then complex novels.

Stage 1: The model learns on small, simple images (256x256 pixels).
Stage 2: It learns to handle images of different sizes and shapes.
Stage 3: It gets "specialized training" on tricky things like human faces and text, which are hard to get right.

The Big Win

The result is a model that is faster, cheaper, and smarter.

Efficiency: It uses 75% fewer tokens (words) to describe an image than other top models.
Quality: It can generate images that look almost perfect (beating the previous best models) and understand complex questions about images.
Versatility: It can chat about an image, draw a new one from scratch, or edit an existing one (like changing a cat's color to pink) all using the same brain.

In short: UniWeTok is the "Swiss Army Knife" of AI vision. It doesn't just see; it understands the story behind the image and can paint a new one just as well, all while using a fraction of the computer power required by its competitors.

1. Problem Statement

Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously satisfies three conflicting objectives:

High-fidelity reconstruction: Preserving fine-grained texture and details.
Complex semantic extraction: Enabling deep understanding and reasoning.
Generative suitability: Supporting stable autoregressive generation without mode collapse.

Existing solutions struggle to balance these goals:

Continuous tokenizers (e.g., VQGAN, Diffusion-based) often suffer from error accumulation and mode collapse during autoregressive generation.
Discrete tokenizers historically faced criticism for poor reconstruction quality and limited semantic extraction. While recent works (e.g., WeTok, BitDance) scaled codebook sizes to massive dimensions ( $>2^{128}$ ) to improve information density, they often fail to extract sufficient semantic information or are not optimized for unified understanding and generation tasks.
Current Unified MLLMs often use separate encoders for understanding and generation or rely on diffusion objectives that break the autoregressive paradigm, leading to suboptimal performance in multi-turn editing or interleaved tasks.

2. Methodology: UniWeTok

UniWeTok is a unified discrete tokenizer designed to bridge the gap between high-fidelity compression, semantic understanding, and generative priors. It employs a massive binary codebook of size $2^{128}$ and achieves a 32× spatial downsampling rate.

A. Training Framework

The authors introduce two novel loss functions to enhance the tokenizer's capabilities:

Pre-Post Distillation (PPD):
- Goal: To endow the discrete tokens with strong semantic extraction capabilities.
- Mechanism: A pre-trained semantic encoder (Teacher, e.g., ViT-SigLIP) encodes the input image into semantic latents. The UniWeTok encoder aligns both its intermediate features (Pre-Distillation) and final quantized features (Post-Distillation) with the teacher's output using cosine similarity loss.
- Impact: Ensures the compressed discrete tokens retain high-level semantic concepts necessary for MLLM understanding.
Generative-Aware Prior (GAP):
- Goal: To mitigate the difficulty downstream generation tasks face when using massive codebooks.
- Mechanism: The quantized latent sequence is fed into a lightweight, randomly initialized generative model (based on BitDance) to perform a next-token diffusion task. This injects a "generative prior" during training, teaching the tokenizer to produce tokens that follow a distribution suitable for autoregressive generation.
- Impact: Improves generation quality (FID) without sacrificing reconstruction or understanding capabilities.

B. Model Architecture

Hybrid Backbone:
- Combines Convolutional layers (for local inductive bias and texture detail) with Transformer blocks (for global context and semantic reasoning).
- The encoder and decoder mirror this structure, optimizing the trade-off between computational efficiency and receptive field.
SigLu Activation Function:
- Problem: Standard Group-Wise Lookup-Free Quantization (GQ) creates an optimization conflict between the Commitment Loss (anchoring outputs to $\pm 1$ ) and Token Entropy Loss (driving outputs to infinity), causing semantic distillation to fail.
- Solution: The SigLu activation function ($1 - \frac{e^x}{1+e^x} $) is applied as the final encoder layer. It naturally bounds the encoder output to the interval$ [-1, 1]$.
- Result: This makes the Token Entropy Loss equivalent to the Commitment Loss, allowing the model to set the commitment weight to zero and enabling stable semantic distillation.

C. Three-Stage Training Pipeline

To handle variable resolutions and perceptually sensitive scenarios (faces, text), UniWeTok uses a curriculum learning strategy:

Stage 1 (Large-Scale Pretraining): Trained on general-domain data at a fixed resolution (256×256) for computational efficiency.
Stage 2 (Multi-Resolution Continual Pre-training): Simultaneously trains on multiple resolutions to ensure adaptability.
Stage 3 (Text-Face Annealing): Focuses on perceptually sensitive domains (faces and text) to refine details in high-sensitivity areas.

3. Key Contributions

Unified Framework: First discrete tokenizer to successfully unify high-fidelity reconstruction, robust semantic extraction, and generative suitability within a single model using a massive $2^{128}$ binary codebook.
SigLu Activation: A novel activation function that resolves the optimization conflict between entropy and commitment losses, enabling stable semantic distillation on discrete tokens.
Training Innovations: Introduction of Pre-Post Distillation and Generative-Aware Prior losses, which significantly boost performance in both understanding and generation tasks.
Efficiency: Achieves state-of-the-art performance with significantly lower training compute compared to competitors (e.g., 33B training tokens vs. 262B for REPA).

4. Experimental Results

UniWeTok was evaluated on ImageNet, general-domain datasets (DataComp-1B), and various MLLM benchmarks.

Image Generation (Class-to-Image):
- Achieved FID 1.38 on ImageNet 256×256, outperforming REPA (1.42) and other SOTA models.
- Required only 33B training tokens (vs. 262B for REPA) and generates only 64 tokens per image (75% reduction in token count).
Reconstruction:
- Achieved SOTA rFID (0.79) and PSNR (23.26) on ImageNet with 32× downsampling.
- Outperformed general-domain tokenizers (e.g., WeTok, Open-MAGVIT2) on MS-COCO and ImageNet validation sets while using only 25% of the visual token count.
Unified MLLM Performance:
- Understanding: The UniWeTok-Chat model (based on Qwen3-8B) achieved competitive scores on benchmarks like SEED-Bench, POPE, and MMMU, matching or exceeding specialized understanding models.
- Generation: UniWeTok-Gen achieved a DPG Score of 86.63, surpassing the prominent open-source model FLUX.1 [Dev] (83.84).
- Editing: UniWeTok-Edit achieved a GEdit Overall Score of 5.09, outperforming OmniGen (5.06) and rivaling private models, marking the first time an autoregressive model surpassed diffusion models on this specific editing benchmark at a similar parameter scale.

5. Significance

UniWeTok demonstrates that a single, well-optimized discrete tokenizer is sufficient to address the complex challenges of Unified MLLMs. By resolving the historical trade-offs between reconstruction quality, semantic density, and generative stability, it provides a robust and efficient baseline for future unified vision-language models. The work suggests that massive binary codebooks, when paired with appropriate architectural and training innovations (SigLu, PPD, GAP), can replace the need for separate encoders or hybrid diffusion-LLM architectures, streamlining the path toward truly unified multimodal intelligence.

UniWeTok: An Unified Binary Tokenizer with Codebook Size 2128\mathit{2^{128}}2128 for Unified Multimodal Large Language Model