E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

This paper introduces E-comIQ-ZH, a comprehensive framework comprising the E-comIQ-18k dataset with expert-calibrated Chain-of-Thought rationales, the E-comIQ-M evaluation model, and the E-comIQ-Bench benchmark, designed to provide the first automated, human-aligned, and fine-grained assessment of Chinese e-commerce posters.

Meiqi Sun, Mingyu Li, Junxiong Zhu

Published 2026-02-26
📖 4 min read☕ Coffee break read

The Problem: The "AI Poster Factory" is Making Typos

Imagine a massive, high-tech factory that uses AI to print millions of online shopping posters every day. These posters need to look beautiful, show the product clearly, and have perfect text to sell the item.

For a long time, we had AI that could make the pictures look great. But when it came to Chinese text, the AI was like a brilliant artist who couldn't read. It would draw a beautiful picture of a shoe, but the text saying "Buy Now" might have a missing stroke, a weird line break, or a character that looks slightly wrong.

To a human expert, this is a disaster. To a standard AI quality checker, it might look "fine" because the colors are nice and the layout is balanced. We needed a way to teach computers to spot these tiny, critical mistakes just like a human expert would.

The Solution: E-comIQ-ZH

The researchers from Alibaba created a new system called E-comIQ-ZH. Think of it as a super-intelligent, human-like quality inspector specifically trained for Chinese e-commerce posters.

Here is how they built it, broken down into three simple steps:

1. The Training Manual (The Dataset: E-comIQ-18k)

You can't teach a student without a textbook. The researchers created a massive textbook called E-comIQ-18k.

  • The Content: It contains 18,000 real-world shopping posters.
  • The Grading: Instead of just giving a poster a "Pass" or "Fail," human experts graded them on four specific areas:
    • Background: Is the scene right?
    • Object: Is the product clear and undamaged?
    • Text: Are the Chinese characters perfect? (This is the hardest part).
    • Layout: Does everything look balanced?
  • The Secret Sauce (Chain-of-Thought): This is the most important part. The experts didn't just give a score; they wrote a detailed explanation (like a teacher's comment on a test). They explained why a score was low (e.g., "The character 'Happy' is missing a stroke on the left"). The AI learned from these explanations, not just the numbers.

2. The Student (The Model: E-comIQ-M)

Once the "textbook" was ready, they trained a special AI model called E-comIQ-M.

  • How it learns: First, it studied the textbook (Supervised Fine-Tuning) to learn the rules. Then, it practiced on the hardest examples using a technique called GRPO (Group Relative Policy Optimization).
  • The Analogy: Imagine a student taking a practice test. If they get a question wrong, the teacher doesn't just say "Wrong." The teacher says, "You got the answer wrong because you missed this specific detail." The student then adjusts their brain to catch that detail next time.
  • The Result: This model learned to spot subtle errors (like a crooked character or a weird line break) that other powerful AI models (like GPT-4o or Gemini) completely missed.

3. The Final Exam (The Benchmark: E-comIQ-Bench)

To prove their new inspector was the best, they created a final exam called E-comIQ-Bench.

  • They took 500 new products and asked top AI image generators (like Flux, GPT-4o, and Gemini) to create posters for them.
  • Then, they ran the posters through their new inspector (E-comIQ-M) and compared the scores to what human experts said.
  • The Outcome: E-comIQ-M was much closer to human judgment than any other existing tool. It successfully identified that a poster with a beautiful background but a typo in the text was actually a "bad" poster.

Why Does This Matter?

In the world of online shopping, trust is everything. If a poster has a typo or a weird glitch in the text, customers might think the brand is unprofessional or the product is fake.

  • Before: Companies had to hire armies of humans to look at every single AI-generated poster to catch mistakes. It was slow and expensive.
  • Now: With E-comIQ-ZH, companies can automatically check thousands of posters in seconds, catching the tiny text errors that humans would miss if they were tired, and ensuring the final ads look professional.

Summary Analogy

Think of the old AI quality checkers as art critics who only care if the painting is colorful and pretty. They would give a 5-star rating to a painting of a cat that says "C4T" instead of "CAT" because the colors were nice.

E-comIQ-ZH is like a strict editor. It looks at the painting, sees the "C4T," and says, "This is a 1-star poster because the text is broken, even if the cat looks cute." It aligns the computer's judgment with what a human actually cares about when buying things online.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →