Evaluating Generative Models via One-Dimensional Code Distributions

Imagine you are a food critic trying to judge a new restaurant's cooking.

The Old Way (The "FID" Metric):
Currently, most AI image generators are judged by a method called FID. Think of this like a critic who only tastes the main ingredients of a dish but ignores the texture, the plating, and the seasoning.

How it works: It takes a photo and asks a super-smart computer (trained to recognize cats, dogs, and cars) to describe the "gist" of the image.
The Problem: Because this computer is trained to ignore small details (like whether a cat's fur is fluffy or spiky) to recognize the animal quickly, it misses the "art" of the image. It might think a blurry, weirdly textured photo of a cat is just as good as a crisp, beautiful one, because to the computer, they both just say "Cat." It's like judging a painting only by its subject matter, ignoring whether the brushstrokes are messy or beautiful.

The New Idea (The "Token" Approach):
The authors of this paper say, "Let's stop looking at the 'gist' and start looking at the 'building blocks'."

The Analogy: Imagine every image is a sentence written in a secret language. Instead of looking at the meaning of the sentence (the semantic feature), we count the letters and words used.
The Shift: They use a special tool (a tokenizer) that breaks an image down into a sequence of tiny, discrete "codes" or "tokens." Think of these as LEGO bricks. A perfect image is built with a very specific, predictable pattern of bricks. A bad AI image might have the right types of bricks (a blue one for sky, a green one for grass) but they are glued together in a nonsensical way, or there are too many random, weird bricks mixed in.

The Two New Tools:

CHD (The "Grammar Check"):
- What it does: This tool checks if the AI is using the right "vocabulary" and "grammar."
- The Metaphor: Imagine the AI is writing a story. CHD checks two things:
  - Vocabulary (1D): Did the AI use the right words? (e.g., did it use "sky" and "tree" instead of "toaster" and "shoe"?)
  - Grammar (2D): Did it put the words in the right order? (e.g., "The tree is in the sky" is grammatically wrong, just like a tree floating in the air is visually wrong).
- Why it's cool: It doesn't need to be taught what "good" looks like. It just knows that nature follows certain patterns, and if the AI breaks those patterns, the score goes down.
CMMS (The "Stress Test"):
- What it does: This tool judges a single image to see how "broken" it is, without needing a perfect original to compare it to.
- The Metaphor: Imagine a teacher who wants to test a student's ability to spot errors. Instead of showing the student a perfect essay, the teacher takes a perfect essay and intentionally messes it up (scrambles words, adds typos, blurs sentences). The teacher then trains an AI to look at these messed-up versions and say, "This one is 80% bad, this one is 20% bad."
- The Result: Once trained, this AI can look at a brand new image generated by a robot and instantly say, "This looks 90% human-made," or "This looks like a glitchy mess," just by spotting the "typos" in the visual code.

The Big Test (VisForm):
To prove their tools work, the authors didn't just test on photos of dogs and cats. They built a massive library called VisForm.

The Analogy: Instead of just testing the critic on Italian food, they fed them sushi, pizza, tacos, and abstract art.
The Scale: They tested 210,000 images across 62 different styles (from medical diagrams to anime to oil paintings) and 12 different AI models.
The Result: Their new "LEGO counting" method matched human opinions much better than the old "ingredient tasting" method, even for weird or artistic images where the old methods failed.

In Summary:
The old way of judging AI art was like judging a book by its title alone. This new paper suggests we should read the sentences, check the grammar, and count the letters. By looking at the tiny, discrete building blocks of an image rather than its high-level "meaning," they created a ruler that actually measures what humans care about: quality, texture, and coherence.

Here is a detailed technical summary of the paper "Evaluating Generative Models via One-Dimensional Code Distributions."

1. Problem Statement

Current evaluation metrics for generative models (e.g., GANs, Diffusion Models) rely heavily on feature-distribution metrics like the Fréchet Inception Distance (FID). These methods operate on continuous recognition features (e.g., from Inception-V3, CLIP, or DINO) that are explicitly trained to be invariant to appearance variations (texture, style, sharpness) to aid in semantic recognition.

Key Limitations Identified:

Loss of Perceptual Cues: By compressing images into single feature vectors and assuming Gaussian distributions, these metrics discard critical spatial structure and local coherence signals necessary to detect artifacts.
Poor Correlation with Human Perception: They often fail to correlate with human judgments, particularly on non-Gaussian data (e.g., artistic styles, medical images) or when fine-grained artifacts are present.
Dependency on Human Labels: Learned metrics that align better with humans (e.g., HPS, PickScore) require massive, costly human preference datasets and suffer from domain shifts when applied to new styles.

The authors argue that the fundamental issue is evaluating in the space of continuous recognition features rather than the space of discrete visual tokens, which naturally encode both semantic content and perceptual details.

2. Methodology

The paper proposes a paradigm shift to evaluate generative models using discrete 1D token sequences generated by modern image tokenizers (specifically retrained TiTok). This approach treats token statistics as a primary evaluation domain.

A. Core Components

Tokenizer: The authors retrain TiTok on a large, heterogeneous dataset (DataComp) to create a 1D tokenizer that maps $256 \times 256$ images to a compact sequence of 128 discrete tokens from a codebook of size 4,096. This tokenizer captures both semantic content and perceptual details (blur, lighting, sharpness).
Metric 1: Codebook Histogram Distance (CHD)
- Type: Training-free, distribution-based metric.
- Mechanism: It compares the empirical token statistics of real vs. generated images using Hellinger distance.
- Components:
  - CHD-1D (Unigram): Measures the frequency of individual tokens (visual vocabulary usage).
  - CHD-2D (Co-occurrence): Measures the joint probability of adjacent tokens in 2D spatial grids (local "grammar" and structural coherence).
- Output: A composite score (mean of 1D and 2D distances) indicating distributional fidelity without Gaussian assumptions.
Metric 2: Code Mixture Model Score (CMMS)
- Type: No-reference, learned quality metric.
- Mechanism: A lightweight regressor (Transformer + MLP) that predicts a quality score directly from token sequences.
- Training Strategy (Self-Supervised): Instead of using human labels, the model is trained on synthetic degradations of natural images.
  - Token Corruption: Injects uniform random tokens to simulate local noise/glitches.
  - Semantic Fragment Swapping: Swaps token blocks to simulate structural errors (e.g., broken limbs).
  - Pixel-Space Augmentation: Applies standard distortions (blur, JPEG compression, noise) before tokenization.
- Target: The model learns to map the severity of these synthetic corruptions to a continuous quality score using an exponential decay function ( $q(p) = e^{-20p}$ ), mimicking human sensitivity to quality drops.

B. The VisForm Benchmark

To rigorously test these metrics under broad distribution shifts, the authors introduce VisForm:

Scale: 210,000 images.
Diversity: Covers 62 visual forms (photorealistic, anime, 3D renders, medical, scientific diagrams, etc.) and 12 generative models.
Annotations: Expert-rated on 14 perceptual dimensions (e.g., composition, texture naturalness, artifact severity) with high inter-annotator agreement.

3. Key Contributions

Discrete-Token Paradigm: Proposes shifting evaluation from continuous recognition features to structured codebook statistics, arguing that token frequencies and co-occurrences provide a more faithful basis for assessing both semantics and perceptual quality.
Two Novel Metrics:
- CHD: A training-free metric sensitive to both semantic shifts and stylistic changes via token histograms.
- CMMS: A no-reference metric that achieves strong alignment with human judgment without requiring human preference data for training, relying instead on synthetic degradation supervision.
VisForm Benchmark: A large-scale, diverse benchmark enabling comprehensive cross-domain evaluation, addressing the lack of datasets that cover non-photorealistic and specialized visual domains.

4. Experimental Results

The authors evaluated CHD and CMMS against state-of-the-art baselines (FID, KID, CLIP-FID, DINO-FID, MUSIQ, CLIP-IQA, DEQA, etc.) on AGIQA, HPDv2, HPDv3, and VisForm.

Correlation with Human Judgments:
- CHD achieved Spearman correlations of 0.829 (AGIQA) and 0.867 (HPDv3), significantly outperforming distribution-based metrics like FID (0.771/0.648) and kernel-based methods.
- CMMS achieved even higher correlations: 0.943 (AGIQA) and 0.872 (HPDv3), surpassing all learned IQA baselines.
Pairwise Preference Prediction:
- CMMS achieved the highest accuracy across all benchmarks (e.g., 71.5% on AGIQA, 74.9% on HPDv2), outperforming models like DEQA and Q-Align.
Robustness & Generalization:
- On VisForm, CHD and CMMS maintained high correlations across diverse domains (including medical and abstract art), whereas traditional metrics (like FID) showed significant performance drops on non-photorealistic domains.
- Sample Efficiency: CHD stabilizes with only ~1,000 images, whereas FID requires >10,000 samples to converge, making CHD more efficient for evaluating expensive models.
Ablation Studies:
- Confirmed that 1D tokenizers (TiTok) outperform 2D tokenizers (VQ-VAE).
- Showed that combining unigram and 2D co-occurrence statistics is crucial for CHD.
- Demonstrated that the exponential quality mapping and combined synthetic degradations are optimal for CMMS.

5. Significance

This work fundamentally rethinks how generative models are evaluated. By leveraging the discrete token space, the authors demonstrate that:

Perceptual Quality is Quantifiable via Statistics: Artifacts and quality issues manifest as predictable deviations in token statistics (entropy, co-occurrence) rather than just feature vector distances.
Training-Free & Scalable: CHD requires no training, and CMMS requires only synthetic data, removing the bottleneck of expensive human annotation.
Domain Agnostic: The approach is robust across vastly different visual styles (from photos to medical diagrams), solving the "domain shift" problem that plagues current learned metrics.

The paper establishes a unified framework for perceptually aligned quality assessment and releases all code, models, and the VisForm dataset to facilitate future research in generative AI evaluation.

Evaluating Generative Models via One-Dimensional Code Distributions

1. Problem Statement

2. Methodology

A. Core Components

B. The VisForm Benchmark

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers