DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Imagine you are a food critic trying to review 400 different restaurants in a city. To do a fair job, you'd ideally need to eat at every single restaurant, try every dish, and write a detailed report. But here's the problem: you only have one stomach, one day, and a very small budget. Eating at all 400 places would take years and cost a fortune.

This is exactly the problem facing AI researchers today. They have built hundreds of new "AI chefs" (Large Language Models), but testing them on massive, complex benchmarks is becoming too expensive and slow. It takes thousands of hours of supercomputer time just to grade one model.

Enter DISCO (Diversifying Sample Condensation). Think of DISCO as a brilliant, lazy food critic who figures out a shortcut: "I don't need to eat at every restaurant. I just need to find the specific dishes where the chefs disagree the most."

Here is how it works, broken down into simple concepts:

1. The Old Way: The "Average" Approach

Previous methods tried to pick a small, "representative" sample of questions to test the AI. They used complex math to group questions together (like clustering) and picked one from each group.

The Flaw: It's like picking one "average" dish from every cuisine. If you ask 400 chefs to make a "standard" burger, they might all make something very similar. You learn nothing new about who is actually the best chef because they all agree on the easy stuff.

2. The DISCO Way: The "Disagreement" Approach

DISCO changes the strategy. Instead of looking for "average" questions, it looks for chaos.

The Analogy: Imagine you ask 400 chefs to solve a tricky riddle.
- If 399 chefs say "The answer is A" and 1 chef says "The answer is B," that's not very interesting.
- But if 200 chefs say "A," 100 say "B," and 100 say "C," that is a goldmine.
Why? When smart people (or AI models) disagree, it means the question is hard and reveals their true personality and capability. DISCO specifically hunts for these "disagreement zones." It selects the top 100 questions where the AI models are most confused and argue with each other the most.

3. The "Signature" Fingerprint

Once DISCO picks these 100 "chaotic" questions, it doesn't just look at the final score (e.g., "80% correct"). It looks at the fingerprint of the answers.

The Analogy: Imagine you don't just ask, "Did you pass the test?" You ask, "Show me your answer sheet."
DISCO takes the entire pattern of answers a model gives on those 100 tricky questions and turns it into a unique "signature" (like a DNA strand).
It then uses a simple computer program to look at this signature and say, "Ah, this pattern looks exactly like the models that usually get 95% on the full test. This one is probably a 95% too."

4. The Result: A Massive Savings

By using this method, DISCO achieves something magical:

The Old Way: You test the AI on 10,000 questions. It takes 10 hours and costs $1,000.
The DISCO Way: You test the AI on just 100 questions (the ones where everyone disagrees). It takes 6 minutes and costs $1.
The Accuracy: Even though they only looked at 1% of the data, DISCO predicts the final score with almost perfect accuracy (99%+ correlation).

Why This Matters

Speed: Researchers can now test new AI models hundreds of times a day instead of once a week.
Money: It saves millions of dollars in electricity and computing power.
Environment: Less computing power means a smaller carbon footprint.
Fairness: Smaller labs can now evaluate their models without needing a massive budget, making AI development more inclusive.

In a Nutshell

DISCO realizes that to judge a group of experts, you shouldn't ask them easy questions they all agree on. You should ask them the hard, confusing questions where they argue. By focusing on disagreement rather than agreement, DISCO creates a tiny, super-efficient test that tells you everything you need to know about an AI's true intelligence, saving time, money, and energy.

1. Problem Statement

Evaluating modern Large Language Models (LLMs) and Vision models has become prohibitively expensive due to increasing model sizes, the scaling of test-time computation, and the breadth of required benchmarks (e.g., MMLU, HELM, LMMs-Eval). A single evaluation can require hundreds to thousands of GPU hours.

Existing efficient evaluation methods typically follow a two-step framework:

Subset Selection: Selecting a static "anchor" subset of data from the full test set, often using clustering based on model response similarity.
Performance Prediction: Extrapolating the full benchmark performance from the accuracy on this subset, often requiring complex latent parameter estimation (e.g., Item Response Theory).

Limitations of Prior Work:

Selection: Current methods rely on clustering to find "representative" samples. The authors argue that sample diversity is not the primary goal; rather, the goal is to select samples that induce diverse model responses (disagreement).
Prediction: Existing methods often estimate hidden model parameters (like "ability" scores) before predicting performance, adding unnecessary complexity and potential points of failure.

2. Methodology: DISCO

The authors propose Diversifying Sample Condensation (DISCO), a framework that simplifies the evaluation process by focusing on model disagreement rather than sample representativeness.

A. Theoretical Foundation

The paper establishes that for the goal of differentiating and ranking models, inter-model disagreement is the most informative signal.

Proposition 1: The paper proves that the Mutual Information (MI) between a model's identity and its prediction on a specific sample is equivalent to the Generalized Jensen-Shannon Divergence (JSD) of the prediction distributions of a set of source models.
Implication: Samples with the highest JSD (where source models disagree the most) provide the maximum information for estimating benchmark performance.

B. Step 1: Dataset Selection (Condensation)

Instead of clustering, DISCO selects a small subset of $K$ samples (e.g., 100) from the original dataset $D$ based on a Disagreement Score.

Metrics: The authors utilize two metrics to quantify disagreement among a set of source models $F$ $F$ :
1. Generalized Jensen-Shannon Divergence (JSD): The theoretically optimal measure derived from information theory.
2. Predictive Diversity Score (PDS): A continuous generalization of the number of unique argmax predictions among models. It is defined as:
  $\text{PDS} = \frac{1}{C} \sum_{c} \max_{m} f^m_c(x_i)$
  where $f^m_c(x_i)$ is the probability of class $c$ for model $m$ on sample $i$ .
Process: The algorithm ranks all samples by their PDS or JSD score and selects the top- $K$ samples. This avoids complex clustering and relies on simple, sample-wise statistics.

C. Step 2: Performance Prediction

Once the subset is selected, DISCO predicts the full benchmark performance of a target model using a Model Signature.

Model Signature: Defined as the concatenation of the target model's raw outputs (probabilities or logits) on the selected $K$ samples.
Dimensionality Reduction: To handle high dimensionality, Principal Component Analysis (PCA) is applied to the signatures.
Prediction Models:
- Non-parametric (kNN): Finds the $K$ most similar source models based on signature distance and averages their known full-set performance.
- Parametric (Regression): Trains a simple regressor (e.g., Random Forest, Linear Regression) to map the reduced signature directly to the full benchmark accuracy.
Key Innovation: This approach bypasses the need to estimate latent psychometric parameters (like "ability" in IRT), relying instead on a direct mapping from output patterns to performance.

3. Key Contributions

Theoretical Insight: Proved that maximizing inter-model disagreement (via JSD) is information-theoretically optimal for greedy sample selection in model ranking tasks.
Simplified Framework: Replaced complex clustering and latent parameter estimation with a two-step process: Disagreement-based selection + Direct signature-based prediction.
Model Signatures: Introduced the use of concatenated model outputs as a rich feature set for performance prediction, demonstrating that simple regressors on these signatures outperform complex psychometric models.
Robustness: Validated the method across different model splits (including chronological splits to prevent data leakage) and domains.

4. Experimental Results

The authors evaluated DISCO on four language benchmarks (MMLU, HellaSwag, Winogrande, ARC) and one vision benchmark (ImageNet).

Performance on Language Tasks (MMLU):
- Cost Reduction: Achieved a 99.3% reduction in evaluation cost (reducing the test set from ~14k to 100 samples).
- Accuracy: Achieved a Mean Absolute Error (MAE) of 1.07 percentage points and a Spearman rank correlation of 0.987.
- Comparison: Outperformed state-of-the-art baselines like tinyBenchmarks (Polo et al., 2024), Anchor Points (Vivek et al., 2023), and Metabench (Kipnis et al., 2024). For instance, DISCO reduced MAE from ~2.08% (tinyBenchmarks) to 1.07%.
Performance on Vision Tasks (ImageNet):
- Reduced the validation set to 100 samples with 99.8% cost reduction.
- Achieved 0.63% MAE and 0.969 rank correlation, significantly outperforming Lifelong Bench and SSEPY.
Ablation Studies:
- Selection: Using PDS/JSD for selection consistently outperformed random sampling and stratification.
- Prediction: Random Forest on model signatures was the most effective predictor, outperforming complex IRT-based methods.
- Robustness: The method remained effective even when using a chronological split (training on older models, testing on newer ones), demonstrating generalizability to unseen model architectures.

5. Significance and Impact

Efficiency: DISCO enables the evaluation of hundreds of models with minimal GPU resources, making frequent performance tracking during training feasible.
Inclusivity: By drastically lowering the cost of evaluation, it allows smaller research groups and organizations to benchmark their models against large-scale standards without prohibitive compute costs.
Paradigm Shift: The paper challenges the prevailing "representativeness" paradigm in benchmarking, arguing that diversity of model response is the true driver of information content in evaluation datasets.
Limitations: The method currently requires models to output predictive probabilities for predefined classes, making it less suitable for open-ended generation tasks (e.g., translation) without further adaptation. It also relies on a pool of source models, though the authors show it remains robust to distribution shifts if the source pool is diverse.

In conclusion, DISCO offers a theoretically grounded, empirically superior, and computationally efficient alternative to current model evaluation practices, achieving state-of-the-art results with a fraction of the computational cost.