Not All Candidates are Created Equal: A Heterogeneity-Aware Approach to Pre-ranking in Recommender Systems

Imagine you are the editor-in-chief of a massive newspaper with hundreds of millions of potential stories (candidates) to choose from every morning. Your goal is to pick the top 100 stories to show your readers.

You can't read every single story yourself; it would take forever. So, you hire a team of assistants (the recommendation system) to help you.

The Problem: The "One-Size-Fits-All" Mistake

In the old way of doing things, you had a single, very smart, but very slow assistant. You threw all the stories at them at once.

The paper argues this is inefficient for two main reasons:

The "Noisy Classroom" Problem (Gradient Conflicts):
Imagine your assistant is trying to learn what stories people like.
- Easy Stories: A story about "How to boil water" is obviously boring. Your assistant knows this immediately.
- Hard Stories: A story about "A hidden gem restaurant in your neighborhood" is tricky. It looks like a positive story, but maybe it's bad.
- The Conflict: When you mix these together, the assistant gets confused. The "Hard" stories scream so loudly ("Look at me! I'm tricky!") that they drown out the "Easy" ones. The assistant spends all its energy trying to solve the hard puzzles and ignores the easy ones, or worse, gets frustrated and learns the wrong lessons. It's like a teacher trying to teach a class where the smartest students are shouting so loud that the quiet students can't learn anything.
The "Overkill" Problem (Computational Waste):
Your super-smart assistant takes 10 seconds to read a story.
- If you ask them to read the "How to boil water" story, they waste 10 seconds on something that takes 1 second to understand.
- If you ask them to read the "Hidden Gem" story, they need those 10 seconds.
- The Waste: You are paying for expensive brainpower on simple tasks. It's like hiring a Nobel Prize-winning physicist to count the number of apples in a basket. They can do it, but it's a waste of money and time.

The Solution: HAP (The Smart Team)

The authors propose a new system called HAP (Heterogeneity-Aware Adaptive Pre-ranking). Think of HAP not as one person, but as a two-stage assembly line with a smart manager.

Step 1: The "Quick Scan" (Lightweight Model)

First, the stories go to a fast, cheap intern.

This intern is good at spotting the obvious junk.
They quickly scan the "How to boil water" stories and the random noise.
They say, "This is boring, throw it away."
Result: 90% of the stories are filtered out instantly with very little effort.

Step 2: The "Deep Dive" (Expressive Model)

The remaining 10% of stories are the tricky ones—the "Hard" stories that look interesting but might be bad.

These are passed to the Nobel Prize-winning physicist (the heavy-duty model).
Because the intern already filtered out the easy stuff, the expert only has to focus on the difficult puzzles.
The expert uses their full brainpower to decide which of these tricky stories are actually great.

The Secret Sauce: "Harmonizing the Noise"

The paper also introduces a special training technique called GHCL (Gradient-Harmonized Contrastive Learning).

The Metaphor: Imagine the intern and the expert are in a room together learning. Usually, the expert's loud voice (strong gradients from hard samples) drowns out the intern's quiet observations.
The Fix: HAP puts a "soundproof glass" between the two groups. It teaches the intern to learn from the easy stories and the expert to learn from the hard stories, separately. Then, it combines their lessons so they don't fight each other. This way, the system learns from everything without getting confused.

The Results: Why It Matters

When the authors put this system into the real world (on the Toutiao news app, which has hundreds of millions of users):

Better Recommendations: People stayed in the app longer and opened it more often because the stories were actually better.
Cheaper & Faster: Even though they added a "super-expert" model, the system actually became cheaper to run. Why? Because the super-expert only looked at the top 10% of stories, while the cheap intern did the heavy lifting for the rest.
No Lag: The time it took to show a story to a user didn't get slower; it stayed the same or got faster.

Summary

HAP is like realizing that not all problems require a PhD to solve.

It uses a fast, cheap filter to handle the easy stuff.
It uses a smart, powerful brain only for the hard stuff.
It trains them in a way so they don't argue with each other.

The result? A smarter, faster, and cheaper recommendation system that makes users happier.

Here is a detailed technical summary of the paper "Not All Candidates are Created Equal: A Heterogeneity-Aware Approach to Pre-ranking in Recommender Systems."

1. Problem Statement

In large-scale industrial recommender systems (e.g., ByteDance's Toutiao), the pre-ranking stage sits between retrieval and ranking. Its goal is to filter thousands of retrieved candidates down to a few hundred for the final ranking stage within milliseconds.

The core challenge identified is Candidate Heterogeneity. The training data for pre-ranking is a mixture of samples with vastly different difficulties:

Easy Samples: Trivially irrelevant items (e.g., global random negatives, items ranked very low in pre-ranking).
Hard Samples: Items that are semantically similar to positives or were exposed but not clicked (e.g., exposed negatives, items ranked just below the threshold in the ranking stage).

Key Issues with Prevailing Methods:

Gradient Conflicts: Indiscriminately mixing easy and hard samples causes "gradient dominance." Hard samples induce disproportionately large gradients (due to higher logit scores), overwhelming the learning signal from easy samples. This leads to unstable convergence and suboptimal local minima.
Computational Inefficiency: Current industrial practices often scale up model complexity uniformly for all candidates. This wastes computational resources on easy cases (which simple models can handle) while slowing down training without proportional gains.

2. Methodology: HAP Framework

The authors propose Heterogeneity-Aware Adaptive Pre-ranking (HAP), a unified framework addressing both optimization and efficiency through two main components:

A. Gradient-Harmonized Contrastive Learning (GHCL)

To resolve gradient conflicts, HAP introduces a tailored loss function based on InfoNCE (a contrastive loss).

Mechanism: Instead of treating all negatives in a single softmax denominator, GHCL disentangles negatives into two groups based on difficulty:
- Hard Set ( $N_{hard}$ ): Exposed Negatives (EN) and Ranking Negatives (RN).
- Easy Set ( $N_{easy}$ ): Pre-ranking Negatives (PRN) and Global Random Negatives (GN).
Mathematical Insight: In standard InfoNCE, the gradient for a hard negative is exponentially larger than for an easy one ( $R_{ORG} = e^{s_{hard} - s_{easy}}$ ). GHCL computes separate loss terms for hard and easy groups:
$L_{GHCL} = L_{hard} + L_{easy}$
This acts as a correction factor, reducing the gradient ratio between hard and easy samples ( $R_{GHCL} < R_{ORG}$ ), thereby harmonizing gradient contributions and stabilizing training.

B. Difficulty-Aware Model Routing (DAMR)

To address computational inefficiency, HAP employs a two-stage cascade architecture rather than a single monolithic model.

Stage 1 (Lightweight Model): A small, efficient model ( $f_l$ ) processes all candidates. It is trained on the full spectrum of negatives (using GHCL) to filter out obvious easy negatives.
Stage 2 (Expressive Model): A larger, deeper model ( $f_c$ ) processes only the candidates that pass the first stage (the "hard" subset). It is trained exclusively on hard negatives (EN and RN) to specialize in fine-grained discrimination.

Benefit: This allocates high computational budget only where it is needed (hard samples), reducing overall inference latency and FLOPs while maintaining or improving accuracy.

3. Key Contributions

Theoretical & Empirical Analysis: The paper provides a rigorous analysis of gradient norms and cosine similarities across different negative types, proving that heterogeneous samples cause gradient conflicts under standard BCE and InfoNCE losses.
HAP Framework: The proposal of a unified framework combining GHCL (for stable optimization) and DAMR (for adaptive resource allocation).
Open Dataset (ToutiaoRec): The release of a large-scale, fully annotated, multi-stage industrial dataset (70M user requests, 313M total samples) covering retrieval, pre-ranking, ranking, and re-ranking stages. This fills a gap in public datasets that lack full-pipeline observability.
Industrial Deployment: Successful deployment in the Toutiao production system, demonstrating that theoretical improvements translate to real-world gains.

4. Experimental Results

Offline Performance

Dataset: Evaluated on the new ToutiaoRec dataset.
Metrics: HAP outperformed State-of-the-Art (SoTA) baselines (DSSM, COLD, COPR, HCCP) across all test sets.
- Hard Negatives (THard): HAP achieved an AUC of 0.8023 vs. 0.7917 (HCCP).
- Easy Negatives (TEasy): HAP achieved 0.9468 vs. 0.9441 (HCCP).
Ablation Studies:
- Removing GHCL caused a significant drop in AUC across all difficulty levels, confirming the necessity of gradient harmonization.
- Comparing HAP to a "Unified Large Model" (same capacity but single-stage) showed that HAP achieved higher AUC on hard samples with lower latency, proving the efficiency of the routing strategy.

Online Deployment (Toutiao Production)

Setup: Deployed for 9 months with a gating threshold of 550 candidates (top 30% from the lightweight model passed to the expressive model).
Business Metrics:
- User Active Days: +0.05% increase.
- App Usage Duration: +0.4% increase.
- CTR (Click-Through Rate): +3.0% increase.
Efficiency:
- Latency: Comparable to the previous SoTA model.
- Cost: Reduced CPU usage by 6% compared to the previous SoTA, despite using a more complex architecture, due to the efficient routing of easy samples.

5. Significance

This paper addresses a critical bottleneck in industrial recommender systems: the trade-off between handling diverse candidate difficulties and maintaining system efficiency.

Paradigm Shift: It moves away from "one-size-fits-all" model scaling and uniform data mixing toward heterogeneity-aware strategies.
Practical Impact: It demonstrates that decoupling easy and hard samples not only improves model convergence (via GHCL) but also drastically improves cost-efficiency (via DAMR).
Community Resource: The release of ToutiaoRec enables the research community to systematically study multi-stage recommendation and candidate heterogeneity, which was previously hindered by a lack of suitable public data.

In summary, HAP proves that recognizing and adapting to the inherent heterogeneity of candidates is essential for building the next generation of scalable, high-performance recommender systems.