LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

Imagine you are trying to teach a computer to draw pictures, but instead of giving it a blank canvas and a full box of every color in the universe, you have to give it a limited set of "stamps" or "tokens." The computer has to figure out which stamps to use to recreate the image.

This is the challenge of Image Tokenization. The paper introduces a new method called LGQ (Learnable Geometric Quantization) that solves a major headache in this process: how to use these stamps efficiently without the computer getting confused or lazy.

Here is the breakdown using simple analogies:

The Problem: The "Lazy Librarian" vs. The "Rigid Filing Cabinet"

To understand why LGQ is special, we need to look at the two previous ways computers tried to do this:

The Old Way (Vector Quantization / VQ):
- The Analogy: Imagine a librarian with a massive shelf of 16,000 unique books (the codebook). When a student asks for a book, the librarian looks at the request and picks the single closest book on the shelf.
- The Flaw: Over time, the librarian gets lazy. They keep picking the same 50 popular books because they are easy to find. The other 15,950 books gather dust and are never used. This is called "Collapse." The system stops learning new things because it's stuck using the same few tools.
The Rigid Way (FSQ / Scalar Quantization):
- The Analogy: To fix the laziness, someone built a giant filing cabinet with fixed drawers. Every time a request comes in, the librarian must put a file in a specific drawer, no matter what.
- The Flaw: This ensures every drawer gets used (no laziness), but the drawers are fixed in a rigid grid. If the "files" (the image data) are shaped like a circle, but the drawers are square, the librarian wastes a lot of space trying to fit round files into square boxes. It's efficient in usage but inefficient in shape.

The Solution: LGQ (The "Smart, Adaptable Map")

LGQ is like a librarian who doesn't just pick one book or force a file into a fixed drawer. Instead, they learn to draw a custom map of the library as they go.

Here is how it works:

Soft Assignments (The "Warm" Selection):
Instead of immediately grabbing the one closest book, the librarian initially says, "This request is 60% like Book A, 30% like Book B, and 10% like Book C."
- Why this helps: This allows the computer to update all those books (A, B, and C) at the same time. It prevents the "lazy librarian" problem because every book gets a little bit of attention during training.
Learning the Geometry (The "Shape-Shifting"):
As the librarian practices, they realize, "Hey, these requests actually look like a circle, not a square!" So, they slowly move the books around on the shelf to match the shape of the requests. They are learning the geometry of the data.
- The Result: The library layout adapts perfectly to the books people actually want, rather than forcing them into a pre-made grid.
The "Straight-Through" Trick:
During the learning phase, the librarian is flexible (soft). But when it's time to actually send the final order (inference), they snap to a decision and pick the one best book. The magic is that the computer learned how to make that decision by practicing with the flexible, soft method first.
The "Popularity" Check (Regularizers):
The system has two rules to keep things fair:
1. Be Confident: Don't be too indecisive (don't say 1% for every book). Pick a clear winner.
2. Be Balanced: Don't let just 50 books get all the work. Make sure the whole shelf gets used, but only the books that are actually needed.

Why Does This Matter?

The paper shows that LGQ is a "Goldilocks" solution:

It's not lazy (it uses the codebook efficiently).
It's not rigid (it adapts to the shape of the data).
The Result: It creates better pictures (lower error rates) while using fewer active stamps than the other methods.

The Big Takeaway:
Previous methods either wasted space (by using too many stamps that didn't fit well) or got stuck using too few stamps (collapsing). LGQ learns the perfect "shape" of the stamp collection for the specific data it's working on. It's like having a set of Lego bricks that can magically reshape themselves to fit the building you are trying to construct, rather than forcing a square peg into a round hole.

In short: LGQ teaches the computer to organize its own vocabulary in the most efficient way possible, leading to sharper images and smarter AI.

1. Problem Statement

Discrete image tokenization is a critical bottleneck for scalable visual generation (e.g., in autoregressive models and diffusion priors). Existing methods face a fundamental trade-off between geometric flexibility and optimization stability:

Vector Quantization (VQ): Learns flexible, data-dependent geometries (Voronoi partitions) but suffers from representation collapse. As vocabulary sizes increase, only a subset of codebook entries are updated (biased gradients), leading to "dead" bins and inefficient capacity utilization. Heuristics like codebook reseeding are often required but do not solve the root cause.
Structured/Scalar Quantization (e.g., FSQ): Uses fixed, axis-aligned grids (implicit lattices). These guarantee full codebook utilization and stable training but rely on rigid geometries that often misalign with heterogeneous, anisotropic latent distributions, wasting representational capacity.

The core challenge is to design a tokenizer that learns its discretization geometry end-to-end from data (like VQ) while maintaining the stable, balanced utilization of structured quantizers (like FSQ) without relying on brittle heuristics.

2. Methodology: Learnable Geometric Quantization (LGQ)

LGQ proposes a unified framework that replaces hard nearest-neighbor assignments with temperature-controlled soft assignments, enabling fully differentiable training.

Core Mechanism

Soft Assignments: Instead of hard nearest-neighbor lookup, LGQ assigns each latent vector $z$ to codebook entries $\{c_1, \dots, c_K\}$ using a Gibbs distribution (softmax) over Euclidean distances:
$p_{t,k} \propto \exp(-\|z_{e,t} - c_k\|^2 / \tau)$
where $\tau$ is a temperature parameter.
Variational Interpretation: These soft assignments correspond to the posterior responsibilities in an isotropic Gaussian Mixture Model. They minimize a variational free-energy objective (expected distortion + entropy).
Straight-Through Estimator (STE): During inference, the model recovers hard discrete assignments ( $k^* = \arg\max p_{t,k}$ ) while preserving gradient flow during training by combining the hard selection with a residual correction from the soft average.
Convergence: Theoretically, as $\tau \to 0$ , the soft assignments provably converge to hard nearest-neighbor quantization.

Regularization Strategies

To prevent collapse and ensure balanced usage without rigid grids, LGQ introduces two complementary regularizers:

Token-Level Peakedness Regularizer ( $L_{peak}$ ): Penalizes high-entropy assignments, encouraging confident (near one-hot) selections while maintaining differentiability.
Global Usage Regularizer ( $L_{bins}$ ): Minimizes the squared $L_2$ norm of the empirical code usage distribution. This discourages concentrated allocation (where a few codes are overused) and encourages uniform utilization across the codebook.

Training Dynamics

The temperature $\tau$ is annealed linearly from a high value (diffuse assignments) to a low value (deterministic assignments) over training epochs. This allows the model to learn the geometry smoothly before converging to a hard discrete state.

3. Key Contributions

LGQ Framework: A novel discrete tokenizer that learns discretization geometry end-to-end via differentiable soft assignments, bridging the gap between flexible VQ and stable FSQ.
Theoretical Guarantees:
- Proof of convergence from soft to hard quantization as temperature approaches zero.
- Proof of Lipschitz continuity of the assignment probabilities, ensuring optimization stability.
Regularization for Scalability: A dual-regularizer approach (peakedness + global usage) that prevents codebook collapse and ensures balanced utilization even with large vocabularies, eliminating the need for heuristic fixes like reseeding.
Empirical Validation: Demonstrated on ImageNet (128x128) showing that LGQ achieves superior reconstruction quality with significantly lower effective representation rates compared to state-of-the-art baselines.

4. Experimental Results

Experiments were conducted using a VQGAN-style backbone on ImageNet with vocabulary sizes of $K=16,384$ and $K=65,536$ .

Reconstruction Quality: LGQ achieved the lowest rFID (110.64) compared to FSQ (125.56), VQ (121.26), LFQ (118.08), and SimVQ (117.77). It also achieved the best SSIM (0.6335) and competitive PSNR.
Efficiency & Utilization:
- Active Codes: LGQ utilized only ~50% of the codebook (8,199 active entries) compared to FSQ/SimVQ which utilized nearly 100%.
- Effective Rate: Despite using fewer active codes, LGQ outperformed baselines. This indicates LGQ learns a more efficient, data-aligned geometry rather than relying on brute-force codebook saturation.
- Scaling: In large vocabulary settings ( $K=65,536$ ), LGQ maintained strong performance (rFID 111.08) with 22.5% utilization, whereas standard VQ collapsed (8.2% utilization) and FSQ required near-full utilization.
Rate-Distortion Trade-off: LGQ operates at a more favorable point on the rate-distortion frontier, achieving lower distortion with a significantly lower effective representation rate.
Geometry Learning: Visualizations (UMAP) and center drift analysis confirmed that LGQ codebook entries actively adapt and reorganize to align with the high-density regions of the latent space, unlike fixed grids or static VQ codebooks.

5. Significance and Impact

Paradigm Shift: LGQ reframes quantization from a fixed lookup operation to a geometry learning problem. This allows tokenizers to adapt to the intrinsic structure of data distributions.
Scalability: By solving the collapse problem through differentiable optimization and regularization, LGQ enables the use of massive vocabularies without the instability that plagues standard VQ.
System Efficiency: The ability to achieve high-fidelity reconstruction with fewer active codebook entries implies more efficient storage and transmission of discrete tokens, which is crucial for large-scale generative models.
Drop-in Replacement: The method is designed as a drop-in replacement for existing discrete tokenizers in autoencoder architectures, requiring no changes to the encoder/decoder backbone.

In conclusion, LGQ provides a principled solution to the scalability-stability trade-off in image tokenization, offering a robust foundation for next-generation multimodal generative models.

LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

The Problem: The "Lazy Librarian" vs. The "Rigid Filing Cabinet"

The Solution: LGQ (The "Smart, Adaptable Map")

Why Does This Matter?

1. Problem Statement

2. Methodology: Learnable Geometric Quantization (LGQ)

Core Mechanism

Regularization Strategies

Training Dynamics

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank