TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Imagine you are a master chef trying to bake the perfect loaf of bread with the words "Fresh Bread" written in icing on top. You want the letters to be perfectly shaped, clear, and readable.

For a long time, the tools we used to judge if the bread was good were like blind taste-testers. They would look at the icing, guess what it probably says based on context, and say, "Ah, yes, that looks like 'Fresh Bread'!" even if the "R" was missing a leg or the "e" was squashed into a blob. They cared about the meaning but ignored the shape.

This paper, TextPecker, introduces a new kind of judge: a structural detective.

Here is the story of how TextPecker fixes the problem, explained simply:

1. The Problem: The "Hallucinating" Judges

In the world of AI image generation, creating images with text is incredibly hard. AI often writes words that look like gibberish, have missing parts, or are distorted.

To fix this, researchers use a "teacher" (a reward system) to tell the AI when it does a good job. Until now, these teachers were OCR models (software that reads text) or Large Language Models (AI chatbots).

The Flaw: These teachers are too smart for their own good. If they see a blurry, distorted letter "A," they don't say, "Hey, that's a broken A!" Instead, they use their brain to guess, "Oh, the user probably meant 'A', so I'll just read it as 'A'."

The Result: The AI gets a "Good Job!" sticker for a terrible, broken letter because the teacher ignored the mistake. The AI never learns to fix the shape of the letters.

2. The Solution: TextPecker (The "Peanut Picker")

The authors created TextPecker. Think of it as a specialized inspector who doesn't care about the meaning of the sentence, but cares deeply about the integrity of every single stroke of the letters.

The Analogy: Imagine a peanut picker on an assembly line. Their job isn't to taste the peanut; their job is to spot the ones that are cracked, crushed, or missing a shell. TextPecker does exactly this for text. It looks at every pixel and asks: "Is this stroke connected? Is this line straight? Is this letter missing a piece?"

3. How They Taught the Inspector

To train this new inspector, the researchers had to build a massive library of "broken text."

Real Breakage: They took thousands of images from different AI generators and had humans mark exactly where the letters were broken.
Fake Breakage (The Secret Sauce): They built a "Lego engine" for Chinese characters. Since Chinese characters are made of many small strokes (like building blocks), they programmed a robot to randomly delete, swap, or add strokes to create thousands of unique, broken characters. This taught the inspector to recognize any kind of breakage, not just the ones it had seen before.

4. The Magic Reward System

TextPecker combines two scores to give the AI a "Report Card":

The Meaning Score: Does the text say what we asked for? (e.g., "Does it say 'Bread'?")
The Structure Score: Is the text physically perfect? (e.g., "Is the 'B' not squished? Is the 'd' not missing a loop?")

If the AI writes "Bread" but the 'e' is a blob, the old teacher gave it a 10/10. TextPecker gives it a 6/10 because of the broken 'e'. This forces the AI to stop guessing and start drawing the letters correctly.

5. The Results: From "Good Enough" to "Perfect"

When they used TextPecker to train top-tier AI models (like Qwen-Image and Flux):

Before: The AI wrote text that looked okay from a distance but was a mess up close.
After: The AI started writing text that was crisp, aligned, and structurally perfect, even for complex Chinese characters.

The Big Picture

TextPecker is like giving the AI a pair of glasses that lets it see the structure of things, not just the idea of them. It solved a major bottleneck where AI was "hallucinating" perfect text from broken images. Now, we can finally generate images with text that is not just readable, but beautifully constructed.

In short: TextPecker stopped the AI from cheating by guessing the answers and forced it to actually learn how to write.

1. Problem Statement

Visual Text Rendering (VTR) in text-to-image generation remains a significant challenge. Even state-of-the-art models (e.g., Flux, Qwen-Image, Seedream) frequently produce text with structural anomalies, including:

Distortion: Warped or stretched characters.
Blurriness: Indistinct strokes.
Misalignment: Characters not fitting the intended layout.
Missing/Extra Strokes: Glyph-level defects that render characters unrecognizable.

The Critical Bottleneck:
Current evaluation and optimization methods rely on OCR models (e.g., PPOCR) or Multimodal Large Language Models (MLLMs) (e.g., GPT-5, Qwen-VL) to recognize generated text and compute rewards (typically based on edit distance). The authors identify a fundamental flaw in this paradigm:

Semantic Over-reliance: These models prioritize semantic recovery over glyph integrity. They often "hallucinate" corrections for structurally flawed text or ignore low-confidence distorted regions to maintain semantic coherence.
Reward Noise: Consequently, they fail to penalize structural errors, providing misleadingly high reward scores to flawed generations. This prevents Reinforcement Learning (RL) from effectively optimizing for structural fidelity.

2. Methodology: TextPecker

The authors propose TextPecker, a plug-and-play RL strategy designed to quantify and reward structural anomaly detection.

A. Structure-Aware Reward Function

Instead of relying on raw OCR accuracy, TextPecker introduces a composite reward ( $R$ ) that jointly optimizes Semantic Alignment and Structural Fidelity:

Structural Quality Score ($SQ$):
- Measures the proportion of characters flagged as structurally anomalous (e.g., missing strokes, extra artifacts).
- Uses a scaling factor ( $\omega > 1$ ) to heavily penalize rare but critical structural failures.
- Formula: $SQ = \text{clip}(1 - \omega \frac{N_a}{N_P}, 0, 1)$ , where $N_a$ is the count of anomalous characters and $N_P$ is the total character count.
Semantic Alignment Score ($SE$):
- Uses Pairwise Normalized Edit Distance (PNED) with Hungarian matching to handle word order mismatches.
- Penalizes unmatched words (extraneous or missing) to ensure semantic consistency.
Composite Reward: $R = w_E \cdot SE + w_Q \cdot SQ$ .

B. Data Construction Pipeline

To train the "structure-aware recognizer" (the assessor used to calculate rewards), the authors constructed a large-scale dataset addressing the scarcity of fine-grained structural annotations:

Text-Rich Image Generation: Generated images using diverse models (Flux, SD3.5, Qwen-Image, etc.) covering English and Chinese.
Human Annotation: Annotators manually identified and marked fine-grained structural flaws at the character level using special markers (e.g., <#> for flawed characters).
Synthetic Data Augmentation:
- To overcome the combinatorial explosion of Chinese character anomalies, they developed a stroke-editing synthesis engine.
- This engine programmatically applies stroke-level operations (Deletion, Swapping, Insertion) to canonical characters to generate diverse structural errors, ensuring robust training for the recognizer.

C. RL Optimization Framework

Algorithm: The method utilizes Flow-GRPO (Group Relative Policy Optimization adapted for Flow Matching models).
Process:
1. Sample $G$ candidate images from the policy model.
2. Pass images through the TextPecker Recognizer to extract text with structural anomaly markers.
3. Compute the composite reward ( $R$ ) for each sample.
4. Normalize rewards within the group to calculate advantages and update the policy.

3. Key Contributions

Identification of a Bottleneck: Demonstrated that leading OCR and MLLM models are "structure-blind," failing to perceive fine-grained glyph defects, which hinders VTR optimization.
TextPecker Framework: A plug-and-play RL strategy that replaces noisy OCR rewards with a perception-guided composite reward, enabling joint optimization of semantics and structure.
Large-Scale Anomaly Dataset: Created a dataset of 1.4M samples with character-level structural anomaly annotations, augmented by a novel stroke-editing synthesis engine to cover diverse error types (especially for Chinese).
New State-of-the-Art: Established a new benchmark for high-fidelity VTR, significantly outperforming existing methods even on highly optimized models like Qwen-Image.

4. Experimental Results

The method was evaluated on multiple benchmarks (OneIG-Bench, CVTG-2K, LongText-Bench, and a new GenTextEval) across English and Chinese.

Evaluator Performance (TSAP & CTR):
- TextPecker's recognizer achieved ~87-92% F1 scores on the Text Structural Anomaly Perception (TSAP) task, vastly outperforming baseline MLLMs (e.g., Qwen3-VL, GPT-5) and OCR models, which scored near 0% on detecting structural anomalies.
VTR Optimization Gains:
- Flux.1[dev]: Improved Semantic Alignment by +38.3% and Structural Quality by +31.6% over the base model.
- Qwen-Image (Highly Optimized): Even on this strong baseline, TextPecker yielded significant gains:
  - Chinese Rendering: +8.7% in Semantic Alignment and +4.0% in Structural Fidelity.
  - English Rendering: Consistent improvements across all benchmarks.
Qualitative Improvements: Visual comparisons show TextPecker-optimized models produce crisp, aligned text with fewer off-target strings and reduced distortion compared to OCR-rewarded baselines.

5. Significance

Foundational Step: TextPecker addresses a critical gap in the VTR pipeline, shifting the focus from mere semantic correctness to structural faithfulness.
Generalizability: As a plug-and-play reward strategy, it can be integrated into any text-to-image generator without architectural changes.
Future Impact: By providing a reliable mechanism to quantify structural quality, this work enables the development of truly reliable visual text generation tools for applications requiring high precision (e.g., document generation, signage, and multilingual publishing). It also opens new avenues for downstream tasks like local text editing and translation, which require precise structural understanding.