Beyond Surrogates: A Quantitative Analysis for Inter-Metric Relationships

This paper proposes a unified theoretical framework that quantifies the relationships between different evaluation metrics using Bayes-Optimal Set and Regret Transfer to bridge the gap between offline validation and online performance by addressing the structural asymmetry in metric mismatch.

Yuanhao Pu, Defu Lian, Enhong Chen

Published 2026-03-10
📖 6 min read🧠 Deep dive

Imagine you are a chef trying to cook the perfect meal for a restaurant.

The Problem: The "Taste Test" vs. The "Customer Review"
In the world of Machine Learning (AI), the "chef" is the AI model, and the "meal" is the prediction it makes (like recommending a movie or a product).

Usually, chefs have two ways to judge their cooking:

  1. The Surrogate Loss (The Practice Run): This is a quick, easy-to-measure test the chef does in the kitchen. It's like checking if the salt is evenly distributed or if the meat is cooked to the right temperature. It's mathematically smooth and easy to optimize.
  2. The Evaluation Metric (The Customer Review): This is the real goal. It's what the customer actually cares about. Did they enjoy the dish? Did they come back? In AI, this is things like "Click-Through Rate" (did people click the ad?) or "NDCG" (did we show the best items at the very top?).

The "Metric Mismatch" Crisis
For years, AI researchers assumed: "If I get my practice run score (Surrogate) perfect, the customer review (Metric) will automatically be perfect too."

But in the real world, this often fails. A chef might perfectly distribute the salt (perfect practice score) but serve the dish on a burnt plate (terrible customer experience). In AI, an algorithm might get a high "AUC" score (a common practice metric) but fail miserably at showing the most relevant items at the very top of a list, causing users to leave the app.

This paper is like a detective story that explains why this mismatch happens and gives us a map to fix it.


The Three Types of Chefs (Metric Categories)

The authors realized that not all "customer reviews" are the same. They grouped them into three distinct families:

  1. The Pointwise Chef (The "Yes/No" Judge):

    • Analogy: This chef looks at each ingredient individually. "Is this tomato ripe? Yes. Is this onion fresh? Yes."
    • AI Example: Accuracy. Did the model guess "Yes" or "No" correctly for each item?
    • The Flaw: It doesn't care about the order. If you have 100 perfect tomatoes, it doesn't matter if you put the best one first or last; the score is the same.
  2. The Pairwise Chef (The "Versus" Judge):

    • Analogy: This chef compares two ingredients at a time. "Is this tomato better than that onion?"
    • AI Example: AUC. It checks if the model generally knows which items are better than others.
    • The Flaw: It cares about the relationship between items, but it treats every pair equally. It doesn't care if the "best" item is at position #1 or position #100, as long as it's ranked higher than the "worst" item.
  3. The Listwise Chef (The "Top-Heavy" Judge):

    • Analogy: This chef is obsessed with the top of the plate. "The first bite is the most important! If the best item isn't right at the front, the whole meal is a failure."
    • AI Example: NDCG, MRR. These metrics scream if the best item isn't in the top 3.
    • The Reality: This is usually what businesses actually want (users only look at the top results).

The Big Discovery: The "One-Way Street"

The paper uses a concept called Regret Transfer to explain what happens when you optimize for one type of chef but try to please another. Think of "Regret" as "how much the customer is unhappy."

1. The "Pointwise Trap" (Why Accuracy Fails)

The authors proved that if you train your AI to be a perfect Pointwise Chef (just getting Yes/No right), you cannot guarantee a good Listwise Chef (Top-Heavy) result.

  • The Metaphor: Imagine you have a bag of red and blue marbles. If you just sort them into "Red Pile" and "Blue Pile" (Pointwise), you might accidentally put the shiniest red marble at the very bottom of the pile.
  • The Result: You get a perfect "Red/Blue" score, but the customer (who only looks at the top) sees a dull blue marble first and leaves. The paper calls this "Pointwise Transfer Failure."

2. The "Listwise Safety Net"

Conversely, if you train your AI to be a perfect Listwise Chef (getting the top items right), you automatically become a decent Pointwise Chef.

  • The Metaphor: If you arrange your marbles so the shiniest ones are at the very top, you have inherently sorted the reds and blues correctly enough to pass the simple "Red/Blue" test.
  • The Result: Optimizing for the "Top" is a safer bet than optimizing for the "Average."

3. The "Pairwise vs. Listwise" Asymmetry

Here is the most surprising finding. Even though Pairwise (AUC) and Listwise (NDCG) seem similar, they are not equal partners.

  • The Metaphor: Imagine a race.
    • Pairwise (AUC) is like checking if the winner finished before the loser. It doesn't matter if the winner finished 1 second ahead or 1 hour ahead.
    • Listwise (NDCG) is like a Formula 1 race where the prize money drops drastically after 1st place.
  • The Finding: If you optimize for the "Race" (Listwise), you usually get a good "Finish Order" (Pairwise). But if you only optimize for the "Finish Order" (Pairwise), you might let the winner finish 10th place, which is a disaster for the Listwise goal.
  • The Scale Effect: The paper shows that as your list of items gets huge (like millions of products), the gap between these two gets massive. A tiny improvement in AUC might mean zero improvement in the top results, while a tiny improvement in NDCG guarantees a huge win.

The Takeaway for the Real World

This paper provides a theoretical map for AI engineers and business leaders.

  • Stop relying on "Practice Scores": Just because your model gets a high "AUC" or "Accuracy" score in the lab doesn't mean it will work in the real world.
  • Optimize for the "Top": If your business goal is to get users to click the first thing they see (like a search engine or a feed), you must train your model using Listwise metrics (like NDCG), not just simple classification metrics.
  • Expect the Gap: If you see your offline metrics (AUC) going up but your online clicks staying flat, the paper explains why: you are likely optimizing for the wrong "Chef."

In short: Don't just train your AI to be "generally correct." Train it to be "perfectly right at the very top," because that's the only place your customers are looking.