LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

This paper introduces LookBench, a live and holistic open benchmark for fashion image retrieval that features time-stamped real-world and AI-generated images, a fine-grained attribute taxonomy, and periodic updates to provide a durable, contamination-aware evaluation of retrieval models in real e-commerce settings.

Gensmo. ai, Chao Gao, Siqiao Xue, Yimin Peng, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are shopping for a very specific outfit. You see a photo of a stranger on the street wearing a perfect "high-neck, cream-colored, oversized knit sweater" paired with "wide-leg denim jeans." You want to find that exact sweater, or at least one that looks and feels exactly the same, to buy online.

This is the daily challenge of Fashion Image Retrieval. But for a long time, the "tests" used to see if computer programs are good at this job have been outdated, like using a map from 1990 to navigate a city that has completely changed since then.

Here is a simple breakdown of the paper "LookBench" and what the authors achieved, using some everyday analogies.

1. The Problem: The "Old Map" vs. The "Live City"

For years, researchers tested fashion search engines using static datasets (like DeepFashion or Fashion200K).

  • The Analogy: Imagine training a taxi driver using a map of a city from 2010. The driver might memorize the streets perfectly, but if you take them to the city today, they will get lost because new buildings, one-way streets, and traffic patterns have changed.
  • The Reality: Modern AI models (like CLIP or DINO) are trained on massive amounts of internet data. Because the old fashion test datasets are so old, these AI models have likely already "seen" the test pictures during their training. It's like giving a student a practice exam that they already have the answers to. They get a perfect score, but they aren't actually smart; they just memorized the test.

2. The Solution: LookBench (The "Live" Test)

The authors created LookBench, a new, "live" benchmark.

  • The Analogy: Instead of a static map, LookBench is like a live traffic camera feed. It constantly updates with new photos taken today from real e-commerce websites and even AI-generated fashion images.
  • Key Feature: Every photo has a "timestamp." If a computer model was trained on data from 2023, but LookBench uses photos from 2025, the model can't cheat by memorizing the answers. It has to actually understand the clothes.

3. The Four Levels of Difficulty

LookBench isn't just one test; it's a video game with four levels of difficulty, ranging from "Easy" to "Nightmare Mode."

  • Level 1: RealStudioFlat (Easy)
    • What it is: Clean, white-background photos of a single shirt or dress, just like you see on a store website.
    • The Goal: Find the exact same shirt.
  • Level 2: AIGen-Studio (Medium)
    • What it is: AI-generated photos of clothes in a studio setting.
    • The Goal: Find the item, but the lighting and texture are slightly different because they were made by a computer.
  • Level 3: AIGen-StreetLook (Hard)
    • What it is: AI-generated photos of people wearing full outfits on busy streets.
    • The Goal: Find the specific jacket or shoes in a messy, complex scene.
  • Level 4: RealStreetLook (Nightmare Mode)
    • What it is: Real photos of people on the street. The clothes are wrinkled, partially hidden by bags, the lighting is weird, and the person is walking.
    • The Goal: This is the hardest test. The computer has to ignore the background, the pose, and the wrinkles to find the exact item.

4. The Secret Sauce: "Attribute" Training

Most AI just looks at the "big picture." LookBench forces the AI to look at the details.

  • The Analogy: Imagine a detective. A bad detective says, "It's a red car." A good detective says, "It's a red 2024 sedan with a scratch on the left door and a specific license plate."
  • How they did it: The authors taught their AI to recognize over 100 specific details (attributes) like "V-neck," "sleeveless," "linen," or "pleated." They used a super-smart AI (Qwen) to label these details on thousands of photos. Now, when the computer searches, it doesn't just look for "red"; it looks for "red, V-neck, linen."

5. The Results: Who Won the Race?

The authors tested many famous AI models against LookBench.

  • The Generic Models (CLIP, DINO): These are like general-purpose detectives. They are good at finding "a car," but they failed miserably at finding "a red 2024 sedan with a scratch." On the hardest test (RealStreetLook), many got less than 40% right.
  • The Fashion Specialists (Marqo): These are better, like detectives who specialize in cars. They got around 60-66% right.
  • The Winners (GR-Pro & GR-Lite): The authors built their own models specifically trained on their "attribute" system.
    • GR-Pro: The "Pro" version is a super-advanced model (kept secret for business reasons) that got the highest scores.
    • GR-Lite: The "Lite" version is open-source (free for everyone to use). It performed almost as well as the Pro version and crushed all other public models.

6. Why This Matters

This paper changes the game in two ways:

  1. It stops the cheating: By using fresh, timestamped data, we can finally tell which AI models are actually smart and which ones just memorized old test questions.
  2. It raises the bar: It shows that to truly understand fashion, AI needs to pay attention to tiny details (like the type of fabric or the cut of a sleeve), not just the general shape.

In summary: LookBench is a new, constantly updating "driving test" for fashion AI. It proves that to find the perfect outfit in a crowded digital world, you need a detective that notices the tiny details, not just a robot that guesses the general color. The authors have released their "best detective" (GR-Lite) to the public so everyone can build better fashion search engines.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →