UniWeTok: An Unified Binary Tokenizer with Codebook Size 2128\mathit{2^{128}} for Unified Multimodal Large Language Model

UniWeTok is a unified binary tokenizer featuring a massive $2^{128}$ codebook, a convolution-attention hybrid architecture with SigLu activation, and a novel three-stage training framework that achieves state-of-the-art performance in image generation and multimodal understanding with significantly lower computational costs than existing models.

Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang

Published 2026-03-12
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a super-smart robot (a Large Language Model) to see, understand, and create art. The biggest problem is that computers "see" images as millions of tiny colored dots (pixels). Trying to teach a robot to predict every single dot one by one is like trying to describe a painting by listing the color of every single grain of sand on a beach. It's too slow, too expensive, and the robot gets confused.

To fix this, researchers use a Tokenizer. Think of a tokenizer as a translator that turns a complex image into a short, simple sentence of "code words" (tokens) that the robot can easily understand and predict.

However, existing translators have a problem: they are either great at describing the meaning of the image (good for understanding) but bad at recreating the picture (bad for drawing), or they are great at recreating the picture but lose all the meaning. It's like having a translator who can either write a beautiful poem about a sunset but can't draw the sun, or one who can draw a perfect sun but can't tell you what it feels like to watch it.

UniWeTok is the new translator that finally does both perfectly. Here is how it works, using some fun analogies:

1. The Massive Dictionary (The $2^{128}$ Codebook)

Imagine you have a dictionary. Most dictionaries have a few thousand words. UniWeTok has a dictionary so huge it has $2^{128}$ words.

  • The Analogy: If a normal dictionary is a small library, UniWeTok's dictionary is a library the size of the entire internet.
  • Why it matters: Because the dictionary is so big, each "word" (token) can hold a massive amount of information. One single word in UniWeTok can describe a complex texture, a specific face, or a whole scene, whereas other models need hundreds of words to say the same thing. This makes the robot much faster and more efficient.

2. The "Pre-Post" Study Session (Pre-Post Distillation)

To make sure the robot understands what it is seeing, the researchers used a "teacher-student" method.

  • The Analogy: Imagine a student (UniWeTok) trying to learn about a painting.
    • Pre-Distillation: Before the student looks at the painting, they listen to an expert art critic describe the vibe and meaning of the piece.
    • Post-Distillation: After the student tries to recreate the painting from memory, the expert checks their work and says, "You got the colors right, but you missed the emotion."
  • The Result: By listening to the expert before and after the task, the student learns to capture both the visual details and the deep meaning simultaneously.

3. The "Generative" Coach (Generative-Aware Prior)

Usually, models learn to understand images, but they forget how to create them.

  • The Analogy: Imagine a chef who is great at tasting food (understanding) but has never cooked a meal (generating). UniWeTok hires a "Generative Coach" who whispers to the chef during practice: "Hey, remember, you're going to have to cook this later, so keep the ingredients fresh and organized."
  • The Result: The model learns to organize the image data in a way that makes it easy to generate new images later, without sacrificing its ability to understand the current image.

4. The Hybrid Engine (Convolution-Attention + SigLu)

The brain of UniWeTok is built differently.

  • The Analogy: Think of looking at a city.
    • Convolution is like looking at the bricks and mortar of individual buildings (local details).
    • Attention is like looking at the skyline to see how the whole city fits together (global context).
    • UniWeTok uses both at the same time.
  • The SigLu Activation: This is a special "brake" system. In previous models, trying to learn details and meaning at the same time caused the brain to get confused and crash (optimization conflict). SigLu acts like a smart governor that keeps the engine running smoothly, ensuring the model doesn't get overwhelmed by the massive dictionary.

5. The Three-Stage Training (Curriculum Learning)

You don't teach a child to read by starting with a PhD thesis. You start with picture books, then chapter books, then complex novels.

  • Stage 1: The model learns on small, simple images (256x256 pixels).
  • Stage 2: It learns to handle images of different sizes and shapes.
  • Stage 3: It gets "specialized training" on tricky things like human faces and text, which are hard to get right.

The Big Win

The result is a model that is faster, cheaper, and smarter.

  • Efficiency: It uses 75% fewer tokens (words) to describe an image than other top models.
  • Quality: It can generate images that look almost perfect (beating the previous best models) and understand complex questions about images.
  • Versatility: It can chat about an image, draw a new one from scratch, or edit an existing one (like changing a cat's color to pink) all using the same brain.

In short: UniWeTok is the "Swiss Army Knife" of AI vision. It doesn't just see; it understands the story behind the image and can paint a new one just as well, all while using a fraction of the computer power required by its competitors.