MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis

Here is an explanation of the paper "MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis" using simple language and creative analogies.

The Big Picture: The "Judge" Problem

Imagine you have a massive library of stories (answers) written by different authors (answerers) in response to different questions. You want to know how good these stories are.

Traditionally, you'd hire a human expert to read every single story and give it a grade. But that takes forever and costs a fortune.

So, instead, you hire a Robot Judge (an AI, or LLM). You ask the Robot Judge to read the stories and give them a score from 1 to 4. This is called "LLM-as-a-Judge."

The Problem:

It's expensive: If you have 50 questions, 50 authors, and 50 different Robot Judges, you have to run the Robot Judge $50 \times 50 \times 50 = 125,000$ times. That's a lot of computer power!
The Judges are biased: Robot Judges aren't perfect. Some might be grumpy, some might be too nice, and some might just really like stories written by robots that sound like them. This is called "bias."

The Solution: Finding the "Typical" Examples

The authors of this paper asked: Can we look at the scores we already have, find patterns, and predict the rest without running the Robot Judge 125,000 times?

They realized the scores aren't random. They form a 3D grid (a Tensor):

Layer 1: The Questions.
Layer 2: The Authors (Answerers).
Layer 3: The Judges (Evaluators).

If you look at the grid, you might see that "Questions about math" always get high scores from "Math-loving Judges," while "Questions about poetry" get low scores from those same judges. The data has hidden blocks or clusters.

The New Tool: MultiwayPAM

The authors built a new tool called MultiwayPAM. To understand it, let's use an analogy.

The Analogy: The "Best Representative" Party

Imagine you are organizing a massive party with 150 guests (50 questions, 50 authors, 50 judges). You want to group them into 5 groups based on how they interact.

Old Method (Standard Clustering):
You might say, "Group 1 is the average of everyone in this group." But an "average" person doesn't exist. You can't invite "Average Person" to the party to represent the group.

MultiwayPAM Method (Medoids):
Instead of an average, MultiwayPAM picks a Medoid.

A Medoid is a real person from the group who is the most "typical" or "central" example.
If Group 1 is "People who love spicy food," the Medoid isn't a theoretical average; it's Bob, a real guest who eats the hottest wings and represents the group perfectly.

How MultiwayPAM Works:

Pick a few "Captains" (Medoids): It randomly picks one Question, one Author, and one Judge to be the "Captains" of their respective groups.
Assign the Crew: It asks, "Who is most like Captain Bob?" and assigns them to his group.
Swap and Improve: It tries swapping captains. "What if we swap Captain Bob with Captain Alice? Does that make the groups more accurate?"
Repeat: It keeps swapping until it finds the best possible set of Captains that represent the whole party.

Why is this cool?

Once MultiwayPAM finds these "Captains" (Medoids), you get two huge benefits:

Savings (The "Lazy" Benefit): You don't need to ask the Robot Judge to grade every single combination. You only need to grade the combinations involving the Captains. Because the Captains represent the whole group, you can guess the scores for the rest of the party with high accuracy. This saves massive amounts of computer money.
Understanding Bias (The "Detective" Benefit): By looking at who the Captains are, you understand the bias.
- Example from the paper: They found that a specific "Nurse" Robot Judge (Medoid) gave very low scores to questions about "physical navigation." Why? Because the Nurse persona is worried about safety!
- Example 2: A "Sports Fan" Robot Judge gave high scores to questions about soccer.
- The Insight: Instead of just seeing a number, you can say, "Ah, this bias happens because the Judge is a sports fan."

The Results

The authors tested this on two real datasets (Truthy and Emerton).

Accuracy: Their method (MultiwayPAM) was better at predicting the missing scores than previous methods.
Interpretability: They could point to a specific question or answer and say, "This is the perfect example of how this group behaves."

Summary

Think of MultiwayPAM as a smart party planner. Instead of interviewing every single guest to understand the crowd, it finds the 5 most representative people for each category (Questions, Authors, Judges).

By studying just these few "Representatives," it can:

Predict how the whole crowd will react (saving money).
Explain why they react that way (revealing bias).

It turns a messy, expensive wall of data into a clean, understandable map of "Who likes what, and why."

Here is a detailed technical summary of the paper "MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis."

1. Problem Statement

The paper addresses two critical challenges in using LLM-as-a-Judge for text evaluation:

Computational Cost: Evaluating a large number of texts across various combinations of questions, answerers (generators), and evaluators (LLM personas) requires $d_1 \times d_2 \times d_3$ inference steps, which is prohibitively expensive.
Evaluator Bias: LLM evaluators exhibit inherent biases (e.g., self-enhancement bias where an LLM favors its own outputs). Understanding the structure of this bias is crucial for mitigation, but existing methods struggle to interpret complex bias patterns across multiple dimensions.

The authors propose modeling the evaluation scores as a 3-mode tensor (Questions $\times$ Answerers $\times$ Evaluators) and applying tensor clustering to reveal latent block structures. However, standard tensor clustering methods often produce clusters that are difficult to interpret because they lack representative "prototypes."

2. Methodology: MultiwayPAM

The authors propose MultiwayPAM (Multiway Partitioning Around Medoids), a novel tensor clustering algorithm that extends the classic Partitioning Around Medoids (PAM) algorithm from vector data to multi-way tensors.

Core Concept

Unlike methods that cluster based on centroids (means), MultiwayPAM simultaneously estimates:

Cluster Membership: Which indices belong to which cluster for each mode.
Medoids: The actual observed data points (indices) that best represent each cluster. This allows for direct interpretation of the cluster composition by examining the specific questions, answerers, or evaluators selected as medoids.

Algorithm Structure

The method operates in two phases to minimize the dissimilarity between the original score tensor $Y$ and a reconstructed medoid tensor $\hat{Y}$ :

BUILD Algorithm (Initialization):
- Greedily selects initial medoids for each mode independently.
- For each mode, it iteratively selects the index that minimizes the sum of dissimilarities (Euclidean distance) to all other slices of the tensor.
- Assigns initial cluster memberships based on the nearest medoid.
SWAP Algorithm (Optimization):
- Iteratively updates the medoid list and membership list to find a local optimum.
- For each mode, it attempts to swap a current medoid with a non-medoid index.
- It calculates the resulting dissimilarity score for the entire tensor if the swap occurs.
- If a swap reduces the global dissimilarity, the change is accepted.
- The process repeats until no further improvement can be made across any mode.

Mathematical Objective:
Minimize $D(Y, \hat{Y}) = \| Y - \hat{Y} \|_2$ , where $\hat{Y}$ is constructed such that every entry $\hat{y}_{i_1...i_K}$ is the value of the original tensor at the medoid indices corresponding to the cluster memberships of $i_1, ..., i_K$ .

3. Key Contributions

Novel Algorithm: Introduction of MultiwayPAM, the first tensor clustering method specifically designed to estimate both block structures and representative medoids simultaneously for multi-way data.
Interpretability: By outputting medoids (actual data indices) rather than abstract centroids, the method allows researchers to directly inspect the "typical" questions, answerers, or evaluators driving specific score patterns.
Bias Structure Analysis: The method provides a framework to answer questions like: "Do similar LLM personas (answerers/evaluators) produce similar scores?" or "Do specific types of questions trigger specific evaluator biases?"

4. Experimental Results

The authors evaluated MultiwayPAM on two datasets: Truthy-DPO-v0.1 and Emerton-DPO-Pairs-Judge.

Setup:
- $d_1 = 50$ Questions.
- $d_2 = 50$ Answerer Personas.
- $d_3 = 50$ Evaluator Personas.
- Scores were generated using GPT-4o mini on a 1–4 scale.
- Cluster size was set to $c = [5, 5, 5]$ .
Findings:
- Interpretability: The medoids revealed clear semantic patterns.
  - Example (Truthy): A specific evaluator cluster (Medoid E14: "A nurse worried about military danger") gave low scores to a question cluster (Medoid Q6: "Navigate physical environment"), while another cluster (Medoid E22: "Trident F.C. fan") gave high scores to a different question cluster (Medoid Q11: "Water intake").
  - Example (Emerton): Score variations were primarily driven by question difficulty. One question cluster (Medoid Q11, stream-of-consciousness) received low scores universally, while another (Medoid Q40, logical entailment) received high scores.
- Performance Comparison:
  - Compared against a baseline Tensor Block Model (TBM) which uses k-means and centroids.
  - RMSE-M (Medoid Error): MultiwayPAM achieved lower error (0.714 vs 0.783 for Truthy; 0.523 vs 0.570 for Emerton), indicating better reconstruction of the original data using actual observed points.
  - RMSE-C (Centroid Error): TBM performed slightly better in terms of centroid approximation, which is expected as centroids minimize squared error by definition. However, MultiwayPAM's superior medoid approximation highlights its utility for selecting representative samples.

5. Significance and Future Work

Significance: MultiwayPAM offers a practical solution to the "black box" nature of LLM evaluation. By identifying medoids, researchers can pinpoint exactly which specific personas or questions cause bias, enabling targeted prompt engineering or model fine-tuning to mitigate these issues. It also offers a potential pathway to reduce inference costs by predicting scores for unobserved combinations based on the identified block structure.
Limitations & Future Directions:
- Cluster Size: The current method requires the number of clusters ( $c$ ) to be predefined. Future work should focus on automatically determining the optimal number of blocks.
- Semantic Similarity: The current dissimilarity metric is purely numerical (Euclidean distance). Future iterations could incorporate semantic similarity (e.g., via embeddings) to ensure that medoids are not just numerically representative but also semantically coherent within their clusters.

In conclusion, MultiwayPAM bridges the gap between high-dimensional tensor analysis and human-interpretable insights, providing a robust tool for diagnosing and understanding the complex biases inherent in LLM-as-a-Judge systems.

MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis

The Big Picture: The "Judge" Problem

The Solution: Finding the "Typical" Examples

The New Tool: MultiwayPAM

The Analogy: The "Best Representative" Party

Why is this cool?

The Results

Summary

1. Problem Statement

2. Methodology: MultiwayPAM

Core Concept

Algorithm Structure

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model