SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Imagine you are trying to teach a robot to both understand a picture (like recognizing a cat) and draw a picture from scratch (like sketching a cat with perfect fur details).

For a long time, AI researchers faced a dilemma:

The "Understanding" Robot: Good at knowing what things are (a cat, a tree, a car) but terrible at drawing them. It sees the "big picture" but misses the tiny details like fur texture or lighting.
The "Drawing" Robot: Amazing at drawing realistic textures and colors, but it doesn't really "know" what it's drawing. It's like an artist who can paint a perfect face but doesn't know the difference between a human and a monkey.

Trying to force one robot to do both jobs usually results in a compromise where it's mediocre at both.

Enter SemHiTok: The "Bilingual Translator" for Images.

This new paper introduces SemHiTok, a clever system that acts as a universal translator for images, allowing a single AI to understand and generate images perfectly. Here is how it works, using some simple analogies:

1. The Problem: The "Blurry Photo" vs. The "Abstract Sketch"

Think of a standard image tokenizer (the tool that turns pictures into code for the AI) as a camera.

If you set the camera to Semantic Mode (for understanding), it takes a photo where the subjects are clear, but the background is blurry. You know it's a "dog," but you can't see the individual hairs.
If you set it to Pixel Mode (for drawing), it takes a photo with razor-sharp details, but the AI gets confused about what the object actually is.

Previous attempts to fix this were like trying to tape a high-definition lens onto a blurry lens. It didn't work well because the two lenses were fighting each other.

2. The Solution: The "Library of Books" Analogy

SemHiTok solves this with a Semantic-Guided Hierarchical Codebook. Let's break that fancy name down:

Imagine a massive library.

The Main Catalog (Semantic Codebook): This is the top level. It organizes books by broad categories like "Animals," "Vehicles," or "Landscapes." When the AI looks at a picture, it first checks this catalog to say, "Ah, this is a Rooster."
The Sub-Shelves (Pixel Sub-Codebooks): Here is the magic. Instead of just having one shelf for "Roosters," SemHiTok creates a special, tiny shelf specifically for the "Rooster" category.
- On this specific shelf, it stores only the details relevant to roosters: red combs, specific feather patterns, and yellow beaks.
- If the AI sees a "Car," it goes to the "Car" shelf, which is stocked with details about wheels, metal, and windshields.

Why is this better?
In the old way, the AI had to guess the details of a rooster from a generic "Animal" shelf. With SemHiTok, once the AI knows it's looking at a rooster, it instantly switches to the "Rooster Detail Shelf." It gets the meaning (it's a rooster) and the texture (feathers) without the two ideas getting in each other's way.

3. The Training: Learning in Stages

The paper also introduces a smart way to teach this system, called Phased Training.

Step 1: Teach the AI to read the Main Catalog (understand the concepts) perfectly.
Step 2: Once the concepts are solid, teach the AI to fill in the details on the specific Sub-Shelves (reconstruct the pixels).

This is like teaching a student to write an essay. First, you teach them the outline and main arguments (Semantics). Once they have the structure down, you teach them how to add descriptive adjectives and vivid details (Pixels). If you try to teach them both at the exact same time, they get confused. Doing it in stages makes them a master writer.

4. The Result: The "Swiss Army Knife" AI

Because of this design, the researchers built a "Unified MLLM" (a giant brain) that uses SemHiTok.

It can look at a photo and answer complex questions about it (e.g., "Is the dog wearing a red collar?").
It can listen to a description and draw a brand new image that looks photorealistic.
It does both without needing two separate brains or doubling the memory.

In a Nutshell

SemHiTok is like giving the AI a smart index card system.

It first reads the Title of the card to know what the object is (Semantic).
Then, it flips to the Back of the card to see the high-definition blueprint of that specific object (Pixel).

By separating the "What" from the "How," but keeping them in the same organized system, SemHiTok allows AI to finally be both a brilliant observer and a master artist at the same time.

1. Problem Statement

The core challenge addressed by SemHiTok is the feature conflict in Unified Multimodal Large Language Models (MLLMs).

The Gap: Multimodal understanding tasks (e.g., VQA) require high-level semantic features but often discard low-level pixel details. Conversely, multimodal generation tasks (e.g., text-to-image) require precise pixel-level reconstruction but often lack semantic coherence.
Limitations of Existing Approaches:
- Joint Training: Methods like VILA-U attempt to train a single tokenizer using both semantic distillation and pixel reconstruction losses simultaneously. This often leads to suboptimal solutions because the optimization objectives for semantics and pixels are conflicting.
- Hybrid Structures: Approaches like SDE decouple encoders but often retain a hybrid codebook that hinders optimization.
- Dual-Encoder: Methods like Janus use separate encoders for understanding and generation, which doubles the token sequence length or vocabulary size, increasing computational complexity and integration difficulty.
Goal: To design a unified image tokenizer that provides consistent discrete representations for both tasks, balancing semantic depth with reconstruction fidelity without inflating the vocabulary or token count excessively.

2. Methodology: SemHiTok

SemHiTok introduces a novel architecture centered on a Semantic-Guided Hierarchical Codebook (SGHC) and a phased training strategy.

A. Architecture: Semantic-Guided Hierarchical Codebook (SGHC)

Instead of a flat codebook or a simple concatenation of two separate token streams, SemHiTok uses a hierarchical structure:

Semantic Branch: A pretrained semantic encoder (e.g., SigLIP) extracts continuous features, which are quantized into a Semantic Codebook ( $C_{sem}$ ). This captures high-level semantics.
Pixel Branch (Sub-codebooks): Based on the observation that image patches sharing the same semantic code tend to have similar pixel features, the model introduces Pixel Sub-codebooks.
- For every semantic code $k$ in $C_{sem}$ , there is a dedicated pixel sub-codebook $C_{pix}^k$ .
- During quantization, the semantic code first selects the specific sub-codebook. Then, the pixel encoder's features are quantized within that selected sub-codebook.
Unified Representation: The final discrete token is formed by concatenating the quantized semantic feature and the quantized pixel feature along the channel dimension.
- Flattening: To integrate with standard MLLMs, the hierarchical structure ( $K$ semantic codes $\times$ $m$ sub-codebook size) is flattened into a single vocabulary of size $K \times m$ . This avoids token sequence inflation.

B. Training Strategy: Phased Optimization

SemHiTok avoids the pitfalls of joint optimization by using a two-stage training paradigm:

Stage 1: Semantic Codebook Training: The semantic branch is trained independently using a semantic distillation loss (VQKD) to ensure high-quality semantic alignment with text. The pixel branch is not involved yet.
Stage 2: Pixel Reconstruction Enablement (PRE): The semantic codebook is frozen. The pixel branch (encoders, sub-codebooks, and decoder) is trained using pixel reconstruction losses ( $L_{img}$ $L_{im g}$ , $L_{per}$ $L_{p er}$ , $L_{gan}$ $L_{g an}$ ).
- Key Advantage: This decouples the training objectives. The semantic structure remains stable, while the pixel branch learns to refine the texture details conditioned on the semantic codes.

C. Unified MLLM Integration

The tokenizer outputs a single sequence of discrete tokens.
A Dual-MLP Adapter is introduced in the MLLM to separately project semantic and pixel features before concatenation, allowing the LLM to process features at different abstraction levels effectively.
The vocabulary is expanded to accommodate the flattened SGHC indices, but the token sequence length remains comparable to standard image tokenizers.

3. Key Contributions

Novel Tokenizer Architecture: Proposes the Semantic-Guided Hierarchical Codebook (SGHC), which models pixel space as sub-codebooks conditioned on semantic codes. This effectively bridges the gap between semantic abstraction and pixel fidelity.
Phased Training Paradigm: Demonstrates that decoupling the training of semantic and pixel branches (rather than joint optimization) leads to a better trade-off between understanding and generation capabilities.
Unified MLLM Performance: Develops a unified MLLM that achieves State-of-the-Art (SOTA) performance in both multimodal understanding and autoregressive image generation, outperforming specialized models and other unified tokenizers.
Efficiency: Achieves high-quality reconstruction without the token sequence inflation seen in dual-encoder approaches, maintaining compatibility with existing next-token prediction frameworks.

4. Experimental Results

The paper evaluates SemHiTok on ImageNet-50k (reconstruction) and various multimodal benchmarks (LLaVA-v1.5 setting, GenAI-Bench, MJHQ30K).

Image Reconstruction:
- On ImageNet-50k, SemHiTok achieves an rFID of 1.16 (256 resolution) and 0.66 (384 resolution).
- It outperforms other unified tokenizers (e.g., VILA-U, SDE, TokenFlow) and approaches the performance of specialized generation models (e.g., IBQ, FQGAN) despite having a more structured, efficient codebook.
Multimodal Understanding:
- Under the LLaVA-v1.5 setting, SemHiTok achieves SOTA among discrete tokenizers.
- It significantly outperforms other unified discrete models (e.g., +6.4 points on MMB, +3.8 points on MMMU compared to ShareGPT4V) and approaches the performance of continuous tokenizers (SigLIP).
Text-to-Image Generation:
- On GenAI-Bench, SemHiTok achieves competitive scores (Basic: 0.83, Advanced: 0.64), matching or exceeding generation-focused models like Liquid.
- On MJHQ30K, it sets a new SOTA for autoregressive generation with a gFID of 5.40 (256 resolution), outperforming diffusion-based experts like SDXL and SD v2.1.

5. Significance

Bridging the Gap: SemHiTok successfully resolves the long-standing trade-off between semantic understanding and pixel-level generation in a single discrete tokenization framework.
Scalability: By avoiding token sequence expansion and using a flattened vocabulary, it offers a scalable path for integrating high-fidelity visual generation into large language models without prohibitive computational costs.
Design Principle: The paper establishes that hierarchical, conditionally structured codebooks combined with phased training are superior to joint optimization for unified vision-language tasks. This suggests a new direction for future research in discrete multimodal modeling, moving away from "one-size-fits-all" joint losses toward specialized, decoupled optimization strategies.