Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

This paper addresses the scarcity of expert textual relevance labels in large-scale app store search by leveraging a specialized, fine-tuned LLM to generate millions of high-quality labels, which, when used to augment the production ranker, significantly improves both offline metrics and real-world conversion rates, particularly for tail queries lacking reliable behavioral data.

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad, Sean Suchter, Venkat Sundaranatha

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine the App Store is a massive, bustling library with millions of books (apps). Your job is to be the librarian who helps visitors find exactly what they want.

To do this job well, you need to answer two questions for every book on the shelf:

  1. The "Click" Test (Behavioral Relevance): Do people actually pick this book up and buy it when they see it? (This is easy to track; you just count the sales.)
  2. The "Description" Test (Textual Relevance): Does the book's title and summary actually match what the visitor asked for? (This is hard to track; you need a human expert to read the book and the request and decide if they match.)

The Problem: The "Expert" Bottleneck

The library is so huge that you have millions of "Click" records, but you only have a tiny handful of "Description" records because hiring human experts to read and judge every book is too slow and expensive.

Because of this, your librarian system is great at showing popular books, but it sometimes fails at showing the right book for specific, weird, or rare requests (like "an app for knitting cats in space"). Without enough expert judges, the system gets lost on these rare requests.

The Solution: The "Super-Intern" (The LLM)

The authors of this paper decided to hire a Super-Intern (a Large Language Model, or LLM) to help the human experts.

Instead of asking the intern to just guess, they trained it by showing it thousands of examples of how the real human experts judged books. They asked the intern: "Here is a request and a book. Based on what you learned from the experts, does this book match the request?"

The Big Discovery:
They tried two types of interns:

  1. A Giant Intern (a huge, pre-trained AI model) who knows everything about the world but hasn't been trained on your specific library rules.
  2. A Specialized Intern (a smaller model that was specifically trained on your library's rules).

The Surprise: The Specialized Intern was far better than the Giant Intern. Even though the Giant Intern was ten times bigger, the Specialized Intern, having studied your specific rules, gave much more accurate answers. It was like a junior employee who knows your company's specific culture beating a genius who just walked in off the street.

The Result: A "Force Multiplier"

Once they found the best Specialized Intern, they let it work overnight. In a single night, it generated millions of "Description" labels, effectively doing the work of thousands of human experts.

They then fed all this new data back into the librarian system.

The Payoff: The "Pareto Frontier" Shift

In the paper, they talk about something called the "Pareto Frontier." Let's translate that:
Imagine a graph where the X-axis is "How much people click" and the Y-axis is "How accurate the description is." Usually, if you try to improve one, the other gets worse. You have to make a trade-off.

What happened here?
By adding the millions of new labels from the Super-Intern, the librarian system didn't just move up or to the right; it moved diagonally outward.

  • It got better at matching descriptions (Textual Relevance).
  • AND it also got better at getting people to click (Behavioral Relevance).

It's as if the librarian suddenly became smarter at reading, which made them so much more helpful that people started buying more books, too.

The Real-World Win: Saving the "Tail" Queries

The most exciting part happened with the "Tail Queries."

  • Head Queries: "iPhone 15 case." (Everyone searches this; we have tons of data on what people click).
  • Tail Queries: "App to translate ancient Sumerian poetry." (Very few people search this; we have almost no data on what people click).

For the rare, weird searches, the old system was guessing in the dark. The new system, powered by the Super-Intern's millions of "Description" labels, suddenly knew exactly what to show.

The Bottom Line:
In a real-world test across the whole world, this new system increased the number of people downloading apps by 0.24%.

  • Is 0.24% small? For a single person, maybe.
  • For the App Store? That's millions of extra downloads and happy users.

In short: They used a smart, specialized AI to act as a "force multiplier" for human experts, generating millions of new clues to help the App Store find the perfect app for even the weirdest searches, making the whole system better for everyone.