Beyond Surrogates: A Quantitative Analysis for Inter-Metric Relationships

Imagine you are a chef trying to cook the perfect meal for a restaurant.

The Problem: The "Taste Test" vs. The "Customer Review"
In the world of Machine Learning (AI), the "chef" is the AI model, and the "meal" is the prediction it makes (like recommending a movie or a product).

Usually, chefs have two ways to judge their cooking:

The Surrogate Loss (The Practice Run): This is a quick, easy-to-measure test the chef does in the kitchen. It's like checking if the salt is evenly distributed or if the meat is cooked to the right temperature. It's mathematically smooth and easy to optimize.
The Evaluation Metric (The Customer Review): This is the real goal. It's what the customer actually cares about. Did they enjoy the dish? Did they come back? In AI, this is things like "Click-Through Rate" (did people click the ad?) or "NDCG" (did we show the best items at the very top?).

The "Metric Mismatch" Crisis
For years, AI researchers assumed: "If I get my practice run score (Surrogate) perfect, the customer review (Metric) will automatically be perfect too."

But in the real world, this often fails. A chef might perfectly distribute the salt (perfect practice score) but serve the dish on a burnt plate (terrible customer experience). In AI, an algorithm might get a high "AUC" score (a common practice metric) but fail miserably at showing the most relevant items at the very top of a list, causing users to leave the app.

This paper is like a detective story that explains why this mismatch happens and gives us a map to fix it.

The Three Types of Chefs (Metric Categories)

The authors realized that not all "customer reviews" are the same. They grouped them into three distinct families:

The Pointwise Chef (The "Yes/No" Judge):
- Analogy: This chef looks at each ingredient individually. "Is this tomato ripe? Yes. Is this onion fresh? Yes."
- AI Example: Accuracy. Did the model guess "Yes" or "No" correctly for each item?
- The Flaw: It doesn't care about the order. If you have 100 perfect tomatoes, it doesn't matter if you put the best one first or last; the score is the same.
The Pairwise Chef (The "Versus" Judge):
- Analogy: This chef compares two ingredients at a time. "Is this tomato better than that onion?"
- AI Example: AUC. It checks if the model generally knows which items are better than others.
- The Flaw: It cares about the relationship between items, but it treats every pair equally. It doesn't care if the "best" item is at position #1 or position #100, as long as it's ranked higher than the "worst" item.
The Listwise Chef (The "Top-Heavy" Judge):
- Analogy: This chef is obsessed with the top of the plate. "The first bite is the most important! If the best item isn't right at the front, the whole meal is a failure."
- AI Example: NDCG, MRR. These metrics scream if the best item isn't in the top 3.
- The Reality: This is usually what businesses actually want (users only look at the top results).

The Big Discovery: The "One-Way Street"

The paper uses a concept called Regret Transfer to explain what happens when you optimize for one type of chef but try to please another. Think of "Regret" as "how much the customer is unhappy."

1. The "Pointwise Trap" (Why Accuracy Fails)

The authors proved that if you train your AI to be a perfect Pointwise Chef (just getting Yes/No right), you cannot guarantee a good Listwise Chef (Top-Heavy) result.

The Metaphor: Imagine you have a bag of red and blue marbles. If you just sort them into "Red Pile" and "Blue Pile" (Pointwise), you might accidentally put the shiniest red marble at the very bottom of the pile.
The Result: You get a perfect "Red/Blue" score, but the customer (who only looks at the top) sees a dull blue marble first and leaves. The paper calls this "Pointwise Transfer Failure."

2. The "Listwise Safety Net"

Conversely, if you train your AI to be a perfect Listwise Chef (getting the top items right), you automatically become a decent Pointwise Chef.

The Metaphor: If you arrange your marbles so the shiniest ones are at the very top, you have inherently sorted the reds and blues correctly enough to pass the simple "Red/Blue" test.
The Result: Optimizing for the "Top" is a safer bet than optimizing for the "Average."

3. The "Pairwise vs. Listwise" Asymmetry

Here is the most surprising finding. Even though Pairwise (AUC) and Listwise (NDCG) seem similar, they are not equal partners.

The Metaphor: Imagine a race.
- Pairwise (AUC) is like checking if the winner finished before the loser. It doesn't matter if the winner finished 1 second ahead or 1 hour ahead.
- Listwise (NDCG) is like a Formula 1 race where the prize money drops drastically after 1st place.
The Finding: If you optimize for the "Race" (Listwise), you usually get a good "Finish Order" (Pairwise). But if you only optimize for the "Finish Order" (Pairwise), you might let the winner finish 10th place, which is a disaster for the Listwise goal.
The Scale Effect: The paper shows that as your list of items gets huge (like millions of products), the gap between these two gets massive. A tiny improvement in AUC might mean zero improvement in the top results, while a tiny improvement in NDCG guarantees a huge win.

The Takeaway for the Real World

This paper provides a theoretical map for AI engineers and business leaders.

Stop relying on "Practice Scores": Just because your model gets a high "AUC" or "Accuracy" score in the lab doesn't mean it will work in the real world.
Optimize for the "Top": If your business goal is to get users to click the first thing they see (like a search engine or a feed), you must train your model using Listwise metrics (like NDCG), not just simple classification metrics.
Expect the Gap: If you see your offline metrics (AUC) going up but your online clicks staying flat, the paper explains why: you are likely optimizing for the wrong "Chef."

In short: Don't just train your AI to be "generally correct." Train it to be "perfectly right at the very top," because that's the only place your customers are looking.

1. Problem Statement

The paper addresses the "Metric Mismatch" problem prevalent in industrial machine learning applications, particularly in recommendation systems and ranking tasks.

The Core Issue: While the relationship between surrogate losses (e.g., Cross-Entropy, BPR) and target metrics is well-studied via Bayes-consistency, the direct relationship between different evaluation metrics (e.g., AUC vs. NDCG, or Accuracy vs. NDCG) remains underexplored.
The Consequence: In practice, optimizing a surrogate loss or a baseline metric (like AUC) often leads to offline gains that fail to translate into online performance improvements (e.g., Click-Through Rate or NDCG). This occurs because Bayes-consistency is an asymptotic property that does not characterize the rate of convergence or the structural sensitivity of metrics during finite-sample optimization.
The Gap: There is a lack of theoretical tools to quantify how an error (regret) in one metric propagates to another, forcing practitioners to rely on intuition or costly A/B testing to navigate performance trade-offs.

2. Methodology

The authors propose a unified theoretical framework based on the structural properties of metrics to analyze their inter-relationships.

A. Taxonomy of Metrics

The paper categorizes evaluation metrics into three structural groups based on their mathematical forms and evaluation behaviors:

Pointwise ( $G_P$ ): Treats items independently (e.g., Accuracy, Precision@k, Recall@k).
Pairwise ( $G_R$ ): Measures relative ordering of item pairs (e.g., AUC).
Listwise ( $G_L$ ): Evaluates the entire ranked list with position-sensitive weights (e.g., NDCG, MAP, MRR).

B. Theoretical Constructs

Bayes-Optimal Sets ( $F^*_M$ ): Defined as the set of all scoring functions that achieve the minimum risk for a specific metric $M$ .
Structural Relationships:
- Bayes-Subsumed ( $\preceq_B$ ): $M_A \preceq_B M_B$ if $F^*_{M_A} \subseteq F^*_{M_B}$ . Optimality in $M_A$ guarantees optimality in $M_B$ .
- Bayes-Equivalent ( $\equiv_B$ ): $M_A \equiv_B M_B$ if $F^*_{M_A} = F^*_{M_B}$ .
Regret Transfer Function ( $\Psi_{A \to B}$ ): A quantitative mapping that defines the worst-case regret on metric $B$ given an $\epsilon$ -regret on metric $A$ :
$\Psi_{A \to B}(\epsilon) := \sup_{f \in \mathcal{F}} \{ \text{Regret}_B(f) \mid \text{Regret}_A(f) \le \epsilon \}$
This function moves beyond asymptotic consistency to provide finite-sample bounds on performance degradation.

3. Key Contributions & Theoretical Results

A. Intra-Group Cohesion

Equivalence & Inclusion: Metrics within the same group (e.g., NDCG and MAP) or with the same truncation depth are often Bayes-equivalent.
Truncation Monotonicity: For truncated metrics (e.g., NDCG@k), the Bayes-optimal sets are nested. Optimizing for a larger $k$ (global) implies optimality for a smaller $k$ (local), but the reverse is not guaranteed.

B. Inter-Group Hierarchy (The Core Findings)

The paper establishes a strict structural hierarchy between the three groups:

Pointwise vs. Ranking ( $G_P$ vs. $G_R/G_L$ ):
- Result: $G_L \preceq_B G_P$ and $G_R \preceq_B G_P$ . The set of optimal ranking functions is a subset of optimal classification functions.
- Implication: A model can be perfectly optimal for Accuracy (Pointwise) while being arbitrarily bad for Ranking (Pairwise/Listwise).
- Theorem 4.4 (Pointwise Transfer Failure): $\Psi_{P \to R/L}(0) > 0$ . Even with zero classification regret, ranking regret can be maximal. This explains why optimizing Cross-Entropy (Accuracy) often fails to improve ranking metrics.
Pairwise vs. Listwise ( $G_R$ vs. $G_L$ ):
- Result: $G_R \equiv_B G_L$ . Both AUC and NDCG share the same Bayes-optimal frontier (preserving the partial ordering of conditional expectations $\eta(x)$ ).
- Implication: While they share the same theoretical optimum, their regret landscapes differ significantly.

C. Asymmetry in Regret Transfer (Scaling Laws)

The most significant contribution is the quantification of asymmetric regret transfer between Pairwise and Listwise metrics, governed by list size ( $n$ ) and label density.

Pairwise $\to$ Listwise ( $AUC \to NDCG$ ): The transfer coefficient grows polynomially (e.g., $O(n \log n)$ in balanced settings). Small errors in AUC can lead to massive degradation in NDCG, especially in the top positions.
Listwise $\to$ Pairwise ( $NDCG \to AUC$ ): The transfer coefficient grows logarithmically ( $O(\log n)$ ). Optimizing NDCG provides a much tighter, more robust guarantee for AUC.
Conclusion: Optimizing Listwise metrics (like NDCG) is theoretically superior for industrial applications because it imposes stricter constraints that naturally satisfy Pairwise and Pointwise objectives, whereas the reverse is not true.

4. Experimental Validation

The authors validated their theory through:

Structural Simulations: Controlled experiments injecting specific error patterns into Pointwise, Pairwise, and Listwise losses.
- Result: Confirmed "Pointwise Transfer Failure" where low classification regret correlated with high ranking regret.
Real-World Experiments (MovieLens-1M):
- Compared losses: BCE (Pointwise), BPR (Pairwise), ListNet (Listwise).
- Result: While BPR achieved slightly higher AUC, ListNet (Listwise) significantly outperformed it on top-heavy metrics like Recall@10 and NDCG@10. This empirically supports the theoretical claim that Listwise optimization is more stable for online objectives.

5. Significance and Impact

Theoretical Bridge: Moves the field from "Surrogate-to-Metric" consistency to "Metric-to-Metric" quantitative analysis, filling a critical theoretical gap.
Practical Guidance: Provides a mathematical justification for why industrial systems often fail when relying solely on AUC or Accuracy. It suggests that Listwise objectives are the most robust choice for ranking tasks because they offer a "strong transfer" guarantee to other metrics.
System Design: Offers a tool for engineers to predict offline-online mismatches. If a model improves Metric A but the transfer bound $\Psi_{A \to B}$ is high (as in Pointwise $\to$ Listwise), offline gains should not be expected to translate to online success.
Asymmetry Insight: Highlights that the relationship between metrics is not symmetric; optimizing for the "harder" metric (Listwise) is safer than optimizing for the "easier" one (Pairwise/Pointwise) in high-stakes ranking scenarios.