CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Imagine you are trying to judge the quality of a new restaurant dish. You don't have a professional food critic, so you ask a panel of 12 different friends to taste it and give you a score.

The Old Way (The Flawed System):
In the past, if you asked 12 friends, you'd just take the average of their scores. If 10 said "8/10" and 2 said "2/10," you'd assume the dish is an "8."

But here's the problem: Your friends aren't independent. They all went to the same cooking school, they all love spicy food, and they all hate it when plates are messy.

Friend A gives high scores to anything spicy.
Friend B gives high scores to anything with a lot of sauce.
Friend C gives high scores to anything served on a fancy plate.

If the dish is actually terrible but happens to be spicy, saucy, and on a fancy plate, all your friends will give it a high score. If you just average their scores, you get a terrible result: a "9/10" for a burnt, salty mess. The friends are "correlated" in their mistakes because they share the same "hidden biases" (confounders).

The New Way (CARE):
The paper introduces CARE (Confounder-Aware Aggregation for Reliable Evaluation). Think of CARE as a super-smart detective who doesn't just listen to the friends; it analyzes how they are thinking.

CARE realizes: "Wait a minute. These friends aren't just judging the food; they are all reacting to the same hidden things: the spice level, the sauce, and the plate."

Instead of blindly averaging, CARE does two things:

Separates the Signal from the Noise: It figures out which part of the score is about the actual taste (the "True Quality") and which part is just about the spiciness or the fancy plate (the "Confounders").
Ignores the Shared Bias: It essentially says, "Okay, Friend A loves spice, and Friend B loves sauce. Let's strip that out of their scores so we can see what they actually think about the flavor."

The Two Detective Tools

The paper offers two different "detective kits" depending on the situation:

CARE-SVD (The Pattern Spotter): Imagine looking at a giant spreadsheet of all your friends' scores. CARE-SVD looks for the biggest "wave" of agreement. If everyone agrees the food is spicy, that's a wave. If everyone agrees the food is good, that's another wave. This tool separates the "Good Food" wave from the "Spicy Food" wave mathematically, even without knowing the true answer beforehand.
CARE-Tensor (The 3D Puzzle Solver): This is for when the data is trickier (like yes/no answers). Imagine your friends are split into three groups. CARE-Tensor looks at how Group A, Group B, and Group C interact together in a 3D puzzle. It's like solving a Rubik's cube where you can see how the colors shift when you turn different sides. This helps it find the "True Quality" piece even if the friends are confused by the same hidden tricks.

Why This Matters

In the real world, we use AI (Large Language Models) to judge other AI. But these AI judges often make the same mistakes because they were trained on similar data. They might all love long, wordy answers or hate short ones.

If we just average their scores, we amplify those mistakes. CARE fixes this by realizing, "Oh, these AI judges are all biased toward long answers. Let's subtract that bias so we can see the real quality."

The Results

The paper tested this on 12 different real-world challenges (like grading essays, checking for toxic comments, or picking the best chatbot response).

Without CARE: The AI judges often got it wrong because they were all fooled by the same tricks (like a long, fancy-sounding but wrong answer).
With CARE: The system stripped away the tricks. It reduced errors by up to 26.8%.

In short: CARE stops us from being fooled by a crowd of friends who are all biased in the same way. It finds the "truth" hidden beneath the shared noise, making our AI evaluations much more reliable.

1. Problem Statement

The standard paradigm for evaluating Large Language Models (LLMs) involves using an ensemble of "LLM-as-a-judge" systems to aggregate scores or preferences. However, current aggregation methods (e.g., majority voting, simple averaging, or heuristic re-weighting) suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality.

In reality, LLM judges often exhibit correlated errors caused by shared latent confounders. These confounders include:

Stylistic preferences: A bias toward verbose, formal, or specific formatting styles.
Training artifacts: Shared biases from pre-training data or model families.
Superficial cues: Sensitivity to length, presence of citations, or specific tokens (e.g., "Step-by-step").

When judges share these confounders, standard aggregation fails to cancel out the noise. Instead, it may amplify systematic mistakes, leading to biased and unreliable evaluation scores. Existing debiasing techniques often target individual judges or assume independent noise, failing to address the shared nature of these errors at the ensemble level.

2. Methodology: CARE Framework

The authors propose CARE (Confounder-Aware Aggregation for Reliable Evaluation), a framework that explicitly models the evaluation process using a latent-variable Markov Random Field (MRF).

Core Conceptual Model

CARE models the observed judge scores ( $J$ ) as arising from two latent sources:

True Quality ( $Q$ ): The ground-truth signal the evaluation seeks to measure.
Shared Confounders ( $C$ ): Latent factors (e.g., verbosity, formatting bias) that simultaneously influence multiple judges, creating correlated errors.

The framework assumes the conditional independence structure among judges is sparse (judges are mostly independent) but mediated by these latent factors.

Two Complementary Estimators

CARE provides two specific instantiations depending on the data regime:

A. CARE-SVD (For Continuous Scores / Joint-Gaussian Assumption)

Mechanism: Utilizes a Sparse + Low-Rank decomposition of the observed precision matrix (inverse covariance) of the judge scores.
Logic: The precision matrix $\Theta$ is decomposed into a sparse component $S$ (direct judge-judge dependencies) and a low-rank component $L$ (dependencies mediated by latent variables).
Recovery: The low-rank component $L$ captures the shared variation. By performing Singular Value Decomposition (SVD) on $L$ , CARE recovers the latent directions.
Identifiability: It uses a symmetry-breaking heuristic (e.g., selecting the leading eigenvector as the "quality" axis based on the assumption that true quality induces the strongest shared variation) to distinguish $Q$ from $C$ .

B. CARE-Tensor (For Discrete/Preference Scores / Gaussian Mixture)

Mechanism: Leverages Tensor Decomposition (specifically CP decomposition) on third-order cross-moments.
Logic:
1. First, it uses the sparse component $S$ from the precision matrix to partition judges into three groups that are conditionally independent given the latent variables.
2. It constructs a third-order tensor from these three groups. Because the groups are conditionally independent, the tensor factorizes cleanly into the latent means and mixture weights.
3. This allows for the identifiable recovery of both the true quality ( $Q$ ) and confounders ( $C$ ) without needing ground-truth labels, overcoming the rotation ambiguity inherent in SVD methods.

Key Technical Steps

Graph Structure Estimation: Estimate the sparse precision matrix and decompose it into $S$ and $L$ .
Latent Factor Extraction:
- SVD path: Extract eigenvectors of $L$ .
- Tensor path: Partition judges based on $S$ , form a 3-way tensor, and decompose it.
Symmetry Breaking: Use heuristics (e.g., leading eigenvector magnitude, balanced loadings, or small human-annotated anchors) to identify which latent factor corresponds to $Q$ and which to $C$ .
Aggregation: Compute the posterior expectation $E[Q|J]$ using the recovered parameters, effectively weighting judges based on their alignment with the true quality factor rather than just their raw scores.

3. Key Contributions

Explicit Confounder Modeling: CARE is the first framework to explicitly model shared latent confounders in LLM-as-a-judge ensembles, moving beyond the assumption of independent noise.
Theoretical Guarantees: The authors provide rigorous proofs for:
- Identifiability: Conditions under which the true quality and confounders can be uniquely recovered.
- Finite-Sample Recovery: Bounds on the error rates for both the spectral (SVD) and tensor paths, showing how sample size, eigengaps, and noise affect performance.
- Misspecification Analysis: Quantification of the systematic bias incurred when confounders are ignored (standard aggregation).
Dual Estimators: The development of CARE-SVD and CARE-Tensor allows the framework to handle both continuous scoring (Gaussian) and discrete/preference-based (Mixture) evaluation settings.
Empirical Validation: Extensive testing across 12 public benchmarks covering continuous scoring, binary classification, and pairwise preferences.

4. Experimental Results

The paper evaluates CARE against baselines including Majority Vote (MV), Simple Averaging (AVG), and Weak Supervision methods (WS, UWS, Dawid-Skene, GLAD, MACE).

Performance Improvement: CARE consistently outperforms baselines across all 12 datasets.
- Error Reduction: Achieves up to 26.8% reduction in error (MAE) compared to Majority Vote on the UltraFeedback dataset.
- Averages: CARE-SVD shows a 17.37% relative improvement over simple averaging on scoring tasks.
- Classification: CARE-Tensor achieves the best accuracy on 5 out of 6 classification/preference benchmarks.
Robustness to Confounders:
- Stylistic Bias: In experiments injecting "Beauty" (emojis/formatting) and "Authority" (fake citations) biases, CARE reduced the Mean Absolute Error (MAE) significantly more than baselines, demonstrating its ability to filter out superficial cues.
- Adversarial Attacks: CARE significantly reduced the False Positive Rate (FPR) against adversarial triggers (e.g., inserting a colon ":" or "Step-by-step" reasoning) that typically fool LLM judges.
Programmatic Judges: CARE successfully integrated noisy, deterministic programmatic judges (which often have high bias) by modeling their specific biases as confounders, improving overall aggregation accuracy.
Interpretability: The latent factors recovered by CARE were successfully mapped to human-interpretable attributes (e.g., "Confounder 1" correlated with verbosity/length, "Confounder 2" with complexity), validating the model's ability to diagnose specific bias sources.

5. Significance

Paradigm Shift: CARE challenges the "wisdom of crowds" assumption in LLM evaluation. It demonstrates that simply averaging multiple LLMs is insufficient if they share systemic biases; instead, the structure of those biases must be modeled and removed.
No Ground Truth Required: Unlike many calibration methods, CARE can recover true quality signals and separate them from confounders without access to ground-truth labels, making it highly scalable for real-world deployment.
Defense Mechanism: By identifying and down-weighting confounder-driven signals, CARE acts as a robust defense against adversarial attacks and superficial stylistic manipulations that currently plague LLM evaluation.
Theoretical Foundation: The paper bridges the gap between weak supervision, latent variable graphical models, and tensor decomposition, providing a principled statistical foundation for future research in automated evaluation.

In conclusion, CARE offers a principled, statistically grounded alternative to heuristic aggregation, significantly improving the reliability and robustness of LLM-as-a-judge systems in the presence of correlated errors and shared confounders.

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

The Two Detective Tools

Why This Matters

The Results

1. Problem Statement

2. Methodology: CARE Framework

Core Conceptual Model

Two Complementary Estimators

Key Technical Steps

3. Key Contributions

4. Experimental Results

5. Significance

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields