E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

The Problem: The "AI Poster Factory" is Making Typos

Imagine a massive, high-tech factory that uses AI to print millions of online shopping posters every day. These posters need to look beautiful, show the product clearly, and have perfect text to sell the item.

For a long time, we had AI that could make the pictures look great. But when it came to Chinese text, the AI was like a brilliant artist who couldn't read. It would draw a beautiful picture of a shoe, but the text saying "Buy Now" might have a missing stroke, a weird line break, or a character that looks slightly wrong.

To a human expert, this is a disaster. To a standard AI quality checker, it might look "fine" because the colors are nice and the layout is balanced. We needed a way to teach computers to spot these tiny, critical mistakes just like a human expert would.

The Solution: E-comIQ-ZH

The researchers from Alibaba created a new system called E-comIQ-ZH. Think of it as a super-intelligent, human-like quality inspector specifically trained for Chinese e-commerce posters.

Here is how they built it, broken down into three simple steps:

1. The Training Manual (The Dataset: E-comIQ-18k)

You can't teach a student without a textbook. The researchers created a massive textbook called E-comIQ-18k.

The Content: It contains 18,000 real-world shopping posters.
The Grading: Instead of just giving a poster a "Pass" or "Fail," human experts graded them on four specific areas:
- Background: Is the scene right?
- Object: Is the product clear and undamaged?
- Text: Are the Chinese characters perfect? (This is the hardest part).
- Layout: Does everything look balanced?
The Secret Sauce (Chain-of-Thought): This is the most important part. The experts didn't just give a score; they wrote a detailed explanation (like a teacher's comment on a test). They explained why a score was low (e.g., "The character 'Happy' is missing a stroke on the left"). The AI learned from these explanations, not just the numbers.

2. The Student (The Model: E-comIQ-M)

Once the "textbook" was ready, they trained a special AI model called E-comIQ-M.

How it learns: First, it studied the textbook (Supervised Fine-Tuning) to learn the rules. Then, it practiced on the hardest examples using a technique called GRPO (Group Relative Policy Optimization).
The Analogy: Imagine a student taking a practice test. If they get a question wrong, the teacher doesn't just say "Wrong." The teacher says, "You got the answer wrong because you missed this specific detail." The student then adjusts their brain to catch that detail next time.
The Result: This model learned to spot subtle errors (like a crooked character or a weird line break) that other powerful AI models (like GPT-4o or Gemini) completely missed.

3. The Final Exam (The Benchmark: E-comIQ-Bench)

To prove their new inspector was the best, they created a final exam called E-comIQ-Bench.

They took 500 new products and asked top AI image generators (like Flux, GPT-4o, and Gemini) to create posters for them.
Then, they ran the posters through their new inspector (E-comIQ-M) and compared the scores to what human experts said.
The Outcome: E-comIQ-M was much closer to human judgment than any other existing tool. It successfully identified that a poster with a beautiful background but a typo in the text was actually a "bad" poster.

Why Does This Matter?

In the world of online shopping, trust is everything. If a poster has a typo or a weird glitch in the text, customers might think the brand is unprofessional or the product is fake.

Before: Companies had to hire armies of humans to look at every single AI-generated poster to catch mistakes. It was slow and expensive.
Now: With E-comIQ-ZH, companies can automatically check thousands of posters in seconds, catching the tiny text errors that humans would miss if they were tired, and ensuring the final ads look professional.

Summary Analogy

Think of the old AI quality checkers as art critics who only care if the painting is colorful and pretty. They would give a 5-star rating to a painting of a cat that says "C4T" instead of "CAT" because the colors were nice.

E-comIQ-ZH is like a strict editor. It looks at the painting, sees the "C4T," and says, "This is a 1-star poster because the text is broken, even if the cat looks cute." It aligns the computer's judgment with what a human actually cares about when buying things online.

1. Problem Statement

Generative AI is increasingly used to create commercial e-commerce posters, particularly in the Chinese market. However, the rapid advancement of image generation models has outpaced the development of reliable Automated Image Quality Assessment (IQA) tools.

Limitations of Existing Methods: Current general-purpose IQA models and Multimodal Large Language Models (MLLMs) focus on low-level distortions (blur, noise) or generic aesthetics. They fail to capture domain-specific functional criteria essential for e-commerce, such as:
- Text Accuracy: Subtle but critical errors in Chinese character strokes, line breaks, and spelling that render a poster commercially unusable.
- Functional Layout: Whether the text and product interact correctly (e.g., occlusion issues) and if the layout effectively highlights selling points.
The Gap: There is a lack of large-scale, human-aligned datasets with multi-dimensional scores and expert reasoning (Chain-of-Thought) to train evaluators that can replace slow, manual human review.

2. Methodology

The authors propose a comprehensive framework consisting of a dataset, a specialized evaluation model, and a benchmark.

A. E-comIQ-18k (Dataset)

Composition: A large-scale dataset containing 18,000 Chinese e-commerce posters sourced from six categories: Merchant HQ/LQ, Open-Source, AI-Generated, AI-Edited, and Professional Designs.
Annotation Schema: Each image is annotated by senior e-commerce art directors across four functional dimensions:
1. Object: Visual integrity, clarity, and absence of distortion.
2. Background: Scene relevance and visual appeal.
3. Text: Legibility, correctness (no stroke errors), and copy quality.
4. Layout: Composition, hierarchy, and spatial arrangement.
Chain-of-Thought (CoT): To ensure the model learns why a score is given, the dataset includes expert-verified CoT rationales. These are generated via a Human-AI collaborative pipeline: an LLM (Qwen-2.5-VL-Max) generates a draft rationale based on scores and tags, which is then rigorously edited by human experts to remove hallucinations and correct reasoning errors.
Reliability: The dataset achieves a high inter-annotator agreement (Krippendorff's $\alpha \approx 0.86$ ).

B. E-comIQ-M (Evaluation Model)

Architecture: Based on Qwen2.5-VL-7B, fine-tuned to act as a specialized e-commerce evaluator.
Training Strategy (Two-Stage):
1. Supervised Fine-Tuning (SFT): Trained on 15k samples to learn the domain-specific scoring format, CoT reasoning patterns, and the relationship between visual features and scores.
2. Generative Reranking Policy Optimization (GRPO): A reinforcement learning stage applied to a "hard subset" (3k samples where the SFT model performed poorly). The reward function combines:
  - Accuracy Reward: Penalizes deviations from ground-truth scores, especially if the prediction crosses quality tiers (e.g., predicting "Good" when the truth is "Poor").
  - Distribution Reward: Encourages geometric consistency between the predicted sub-score vector and the ground truth vector.
Output: The model outputs a structured JSON object containing scores for the four dimensions and an overall score, preceded by a natural language CoT rationale.

C. E-comIQ-Bench (Benchmark)

Protocol: A benchmark containing 500 test cases. Each case includes a product cutout, a Chinese prompt derived from selling points, and an original merchant poster as a reference.
Evaluation: Leading text-to-image models (e.g., GPT-4o, Gemini, Flux, Seedream) generate posters, which are then scored by both human experts and E-comIQ-M to measure alignment and model performance.

3. Key Contributions

E-comIQ-18k: The first large-scale dataset explicitly targeting Chinese e-commerce poster assessment, featuring multi-dimensional functional scores and expert-verified CoT rationales.
E-comIQ-M: A domain-specific evaluation model that significantly outperforms general-purpose MLLMs and existing IQA tools in aligning with human expert judgment, particularly in detecting subtle Chinese text rendering errors.
E-comIQ-Bench: The first automated, scalable benchmark for evaluating the generation capabilities of Chinese e-commerce posters, enabling rigorous comparison of state-of-the-art models.
Human-AI Collaboration Pipeline: A novel methodology for generating high-quality CoT rationales at scale by combining LLM generation with expert editing.

4. Experimental Results

Model Performance: E-comIQ-M achieves the highest correlation with human experts on the test set.
- Correlation: It reaches an overall Spearman Rank Correlation (SRCC) of 0.433 and Pearson (PLCC) of 0.425, significantly outperforming the best baseline (Qwen2.5-VL-7B+SFT at 0.346 SRCC) and general MLLMs like GPT-4o (0.219 SRCC).
- Accuracy: It achieves 55.6% accuracy (Acc@0.5) on the overall score, a substantial improvement over baselines.
- Dimension Specifics: The model shows particular strength in the Text and Layout dimensions, areas where general models typically fail due to a lack of domain knowledge.
Ablation Studies: The two-stage training (SFT + GRPO) is crucial. SFT alone improves performance over the base model, but GRPO further refines score calibration and distribution alignment. The "Complex" reward (Accuracy + Distribution) outperforms "Simple" (Accuracy only).
Benchmark Insights: On E-comIQ-Bench, current generative models (e.g., Flux, Gemini) still lag behind human-designed posters, with Text and Object fidelity being the primary bottlenecks. Notably, standard OCR-based metrics often fail to detect subtle Chinese character rendering errors that E-comIQ-M correctly identifies.

5. Significance

Bridging the Gap: This work addresses the critical bottleneck in commercial AIGC: the lack of automated tools that understand the specific functional requirements of e-commerce (especially for Chinese content).
Scalability: By providing a model that aligns with human experts, the framework enables scalable, automated quality control for e-commerce platforms, reducing reliance on slow manual reviews.
Fine-Grained Diagnosis: Unlike holistic scoring, E-comIQ-M provides diagnostic feedback (via CoT) on specific defects (e.g., "stroke rendering error in character '感'"), which is vital for iterating and improving generative models.
Open Science: The release of the dataset, model, and benchmark tools fosters future research in domain-specific visual quality assessment and human-aligned AI evaluation.