TIQA: Human-Aligned Text Quality Assessment in Generated Images

Imagine you've asked a magical artist (an AI) to paint a picture of a "No Parking" sign. The artist does a fantastic job: the sky is blue, the grass is green, and the pole looks sturdy. But when you look closely at the sign itself, the letters are melting, the "P" looks like a "D," and the lines are wobbly.

If you asked a standard robot to check the picture, it might say, "Great job! I can read 'No Parking'!" because it only cares if the meaning is correct. But if you asked a human, they'd say, "That looks terrible! The letters are broken."

This paper introduces a new tool called TIQA (Text-in-Image Quality Assessment) to solve exactly this problem. Here is the breakdown in simple terms:

1. The Problem: The "Magic Spell" vs. The "Human Eye"

Current AI image generators are getting amazing at creating realistic photos. However, they still struggle with writing text. They often produce "glyphs" (letters) that look like they were drawn by a toddler with a shaky hand.

The Old Way: To check if the text is good, people used OCR (Optical Character Recognition). Think of OCR as a strict librarian who only cares if the book title is spelled correctly. If the librarian can read "Hello," they give it a passing grade, even if the letters are dripping with paint.
The New Way (VLMs): People also tried using giant AI chatbots (like GPT-4) to look at the picture and grade it. But these chatbots are like a picky art critic who gets confused if you ask them slightly different questions. Their answers change depending on how you phrase your request, making them unreliable for consistent grading.

2. The Solution: The "Text Quality Judge" (TIQA)

The authors created a new job description for an AI: TIQA.
Instead of asking, "Can you read this?" TIQA asks, "Does this text look natural and well-drawn?"

The Analogy: Imagine a Food Critic vs. a Nutritionist.
- The Nutritionist (OCR) checks if the burger has a bun, meat, and lettuce. If yes, it's a "pass."
- The Food Critic (TIQA) looks at the burger and says, "The bun is burnt, the meat is raw, and the lettuce is wilted. Even though it's technically a burger, it's a bad one."
- TIQA is the Food Critic for text. It ignores what the words say (semantics) and focuses entirely on how they look (artifacts, broken strokes, weird spacing).

3. The Training: Teaching the Judge

To teach this new AI (called ANTIQA), the researchers didn't just use robots. They used humans.

They created a massive library of 120,000 tiny text snippets from AI images.
They hired thousands of real people to look at these snippets and rate them on a scale of 0 to 5 (0 = gibberish, 5 = perfect).
They taught the ANTIQA model to mimic these human ratings.

4. The Secret Sauce: How ANTIQA Works

The ANTIQA model is special because it was built specifically to look at text, not just general pictures.

The Metaphor: Imagine a standard image checker is like a General Practitioner who checks your whole body. ANTIQA is like a Dermatologist who only looks at your skin.
The model uses a special "lens" (called strip convolutions) that is very good at seeing long, thin lines (like the strokes of a letter "I" or "l"). It knows that text has a specific structure, and it looks for breaks in that structure that a normal image checker would miss.

5. Why This Matters: The "Best of 5" Filter

The most practical use of this tool is in the "production line" of AI image generation.

The Scenario: You ask an AI to generate 5 images of a "Coffee Shop Menu."
Without TIQA: You might pick the one with the best lighting, but the text on the menu is unreadable gibberish.
With TIQA: The system automatically scans all 5 images, checks the text quality, and picks the one where the menu text looks the most real.
The Result: The paper shows that using this filter improves the quality of the final text by 14%. It's like having a quality control inspector who only lets the perfect text through.

Summary

The Issue: AI is great at drawing pictures but bad at drawing readable text.
The Gap: Old tools only checked if the text was readable, not if it looked good.
The Fix: A new AI (ANTIQA) trained to spot "ugly text" (broken lines, weird shapes) just like a human would.
The Benefit: It helps filter out bad AI images automatically, ensuring that when we see text in AI art, it actually looks like text and not a glitchy mess.

In short, TIQA is the spell-checker for the visual world, ensuring that when AI writes, it doesn't just write the right words, but writes them beautifully.

Here is a detailed technical summary of the paper "TIQA: Human-Aligned Text Quality Assessment in Generated Images."

1. Problem Statement

Despite rapid advancements in Text-to-Image (T2I) models, text rendering remains a persistent failure mode. Generated images often contain semantic errors (misspellings) and, more critically, perceptual artifacts such as malformed glyphs, broken strokes, inconsistent thickness, unstable kerning, and baseline jitter.

Existing evaluation methods fail to adequately address these issues:

OCR-based metrics: Focus on semantic correctness (decodability) and require ground truth. They often fail to penalize visual defects (e.g., a readable word with broken strokes) that humans find unacceptable.
Vision-Language Models (VLMs): While capable of reasoning, they suffer from instability due to prompt sensitivity, version drift, and high computational costs. They often struggle to isolate fine-grained text artifacts from the rest of the image content.
Generic Image Quality Assessment (IQA): Standard no-reference IQA metrics (e.g., BRISQUE, NIMA) are designed for natural image distortions (blur, noise) and do not capture typographic-specific failure modes.

The Gap: There is a lack of a dedicated, efficient, and human-aligned metric to assess the perceptual fidelity of rendered text, independent of its semantic meaning.

2. Methodology

A. Task Definition: TIQA

The authors introduce Text-in-Image Quality Assessment (TIQA), a specialized no-reference task.

Input: A cropped image region containing rendered text (a "text crop").
Output: A scalar quality score (0–5) predicting the Mean Opinion Score (MOS) based on human judgments of visual artifacts.
Key Constraint: The score must reflect how the text is rendered (glyph integrity, stroke continuity, spacing), not what the text says. A misspelled but visually perfect word should receive a high score; a correctly spelled word with broken strokes should receive a low score.

B. Datasets

To train and benchmark TIQA, the authors released two datasets:

TIQA-Crops:
- Scale: 120,000 text crops extracted from 36,000 images generated by 12 T2I models.
- Labels: 10,000 crops have human MOS labels (0–5 scale). 110,000 crops are labeled with OCR confidence scores (used for proxy pretraining).
- Annotation: Human raters were instructed to ignore semantics and focus solely on visual artifacts (glyph topology, stroke breaks, etc.).
TIQA-Images:
- Scale: 1,500 full-frame, text-heavy images from 10 modern generators (including proprietary models like GPT Image 1.5 and Nano Banana Pro).
- Labels: Paired MOS annotations for Overall Quality (OQ-MOS) and Text-Only Quality (TQ-MOS).
- Unique Feature: Each image includes a "text-only" view where non-text content is masked out to isolate text quality perception.

C. Proposed Model: ANTIQA

The authors propose ANTIQA, a lightweight, text-biased neural network designed to predict MOS.

Architecture: A compact multi-scale CNN regressor (3.8M parameters).
Input: A 2-channel image: Grayscale + Sobel edge map (to emphasize stroke boundaries).
Key Design Choices:
- Strip Convolutions: Uses directional $1\times k $and$ k\times 1$ residual blocks to capture the anisotropic nature of text strokes and baselines.
- SE Blocks: Squeeze-and-Excitation modules for channel recalibration to handle font/style variations.
- Multi-scale Pooling: Adaptive Average and Max Pooling (APB) at three resolutions to capture both local glyph details and global word structure.
Training Strategy:
1. Pretraining: Trained on 110k crops using OCR confidence scores mapped to the MOS scale via Neural Optimal Transport. This leverages the large volume of unlabeled data.
2. Fine-tuning: Trained on the 10k MOS-labeled crops using a mixed loss function: Mean Squared Error (MSE) + Pairwise Ranking Loss (to ensure correct relative ordering).

3. Key Contributions

New Task Formulation: Defined TIQA as a distinct task focusing on perceptual text artifacts rather than semantic correctness.
Benchmarks: Released TIQA-Crops and TIQA-Images, the first large-scale datasets specifically for human-aligned text quality assessment in AI generation.
Method (ANTIQA): Developed a specialized model that significantly outperforms OCR confidence, VLM judges, and generic IQA metrics.
Downstream Utility: Demonstrated that TIQA scores are effective for Best-of-K selection, filtering, and reranking in generation pipelines.

4. Experimental Results

Performance on TIQA-Crops (Crop-level)

ANTIQA achieved a PLCC of 0.942 and SROCC of 0.935, significantly outperforming:
- OCR Confidence (PaddleOCR): PLCC 0.778.
- VLM Judges (Qwen3): PLCC 0.891.
- Generic IQA (TOPIQ): PLCC 0.401.
Insight: Recognizing text (OCR) is insufficient; the model must capture visual degradations specific to glyphs.

Performance on TIQA-Images (Image-level)

ANTIQA achieved the highest correlation with human scores for both Text-Only MOS (PLCC 0.842) and Overall MOS (PLCC 0.810).
Generalization: The model generalized well to unseen SOTA generators (e.g., GPT Image 1.5) not seen during training.
VLM Limitations: VLMs performed poorly on full images (PLCC ~0.47) because text occupies a small fraction of pixels, diluting their sensitivity to glyph-level artifacts. Their performance improved only when restricted to text-only crops.

Downstream Application: Best-of-K Selection

The authors tested the ability to select the best image from 5 generations (Best-of-5) for a fixed prompt and generator.

ANTIQA improved human-rated text quality by +14% (MOS +0.36) and overall quality by +9.9% (MOS +0.30) compared to random selection.
It closed 72% of the gap to an Oracle selector for text quality, outperforming OCR and VLMs significantly.

Analysis of Modern T2I Models

Using TIQA-Images, the authors found that while newer models (e.g., Flux 2, Nano Banana Pro) show clear improvements over older baselines (SDXL), text rendering remains a bottleneck.
There is a strong coupling between overall image quality and text quality, but visual plausibility often overstates text fidelity (images look good, but text is degraded).

5. Significance and Impact

Practical Value: TIQA provides a scalable, low-latency signal for production pipelines. It enables automated filtering of low-quality generations and reranking without relying on expensive VLMs or ground-truth text.
Research Direction: Highlights that current T2I models still struggle with the "tail" of text rendering failures (rare but severe artifacts), suggesting a need for closed-loop training guidance using TIQA signals.
Human Alignment: By decoupling visual quality from semantic correctness, TIQA offers a more accurate reflection of human perception in text-heavy AI outputs (e.g., posters, UI mockups, documents).

In conclusion, TIQA and ANTIQA represent a critical step forward in evaluating and improving the text rendering capabilities of generative AI, moving beyond simple decodability to assess the visual fidelity of rendered typography.