DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

The paper introduces DISCO, a conceptually simpler and theoretically grounded method for efficient model evaluation that selects diverse samples based on maximizing model disagreement rather than clustering, achieving state-of-the-art performance prediction across multiple benchmarks while significantly reducing computational costs.

Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are a food critic trying to review 400 different restaurants in a city. To do a fair job, you'd ideally need to eat at every single restaurant, try every dish, and write a detailed report. But here's the problem: you only have one stomach, one day, and a very small budget. Eating at all 400 places would take years and cost a fortune.

This is exactly the problem facing AI researchers today. They have built hundreds of new "AI chefs" (Large Language Models), but testing them on massive, complex benchmarks is becoming too expensive and slow. It takes thousands of hours of supercomputer time just to grade one model.

Enter DISCO (Diversifying Sample Condensation). Think of DISCO as a brilliant, lazy food critic who figures out a shortcut: "I don't need to eat at every restaurant. I just need to find the specific dishes where the chefs disagree the most."

Here is how it works, broken down into simple concepts:

1. The Old Way: The "Average" Approach

Previous methods tried to pick a small, "representative" sample of questions to test the AI. They used complex math to group questions together (like clustering) and picked one from each group.

  • The Flaw: It's like picking one "average" dish from every cuisine. If you ask 400 chefs to make a "standard" burger, they might all make something very similar. You learn nothing new about who is actually the best chef because they all agree on the easy stuff.

2. The DISCO Way: The "Disagreement" Approach

DISCO changes the strategy. Instead of looking for "average" questions, it looks for chaos.

  • The Analogy: Imagine you ask 400 chefs to solve a tricky riddle.
    • If 399 chefs say "The answer is A" and 1 chef says "The answer is B," that's not very interesting.
    • But if 200 chefs say "A," 100 say "B," and 100 say "C," that is a goldmine.
  • Why? When smart people (or AI models) disagree, it means the question is hard and reveals their true personality and capability. DISCO specifically hunts for these "disagreement zones." It selects the top 100 questions where the AI models are most confused and argue with each other the most.

3. The "Signature" Fingerprint

Once DISCO picks these 100 "chaotic" questions, it doesn't just look at the final score (e.g., "80% correct"). It looks at the fingerprint of the answers.

  • The Analogy: Imagine you don't just ask, "Did you pass the test?" You ask, "Show me your answer sheet."
  • DISCO takes the entire pattern of answers a model gives on those 100 tricky questions and turns it into a unique "signature" (like a DNA strand).
  • It then uses a simple computer program to look at this signature and say, "Ah, this pattern looks exactly like the models that usually get 95% on the full test. This one is probably a 95% too."

4. The Result: A Massive Savings

By using this method, DISCO achieves something magical:

  • The Old Way: You test the AI on 10,000 questions. It takes 10 hours and costs $1,000.
  • The DISCO Way: You test the AI on just 100 questions (the ones where everyone disagrees). It takes 6 minutes and costs $1.
  • The Accuracy: Even though they only looked at 1% of the data, DISCO predicts the final score with almost perfect accuracy (99%+ correlation).

Why This Matters

  • Speed: Researchers can now test new AI models hundreds of times a day instead of once a week.
  • Money: It saves millions of dollars in electricity and computing power.
  • Environment: Less computing power means a smaller carbon footprint.
  • Fairness: Smaller labs can now evaluate their models without needing a massive budget, making AI development more inclusive.

In a Nutshell

DISCO realizes that to judge a group of experts, you shouldn't ask them easy questions they all agree on. You should ask them the hard, confusing questions where they argue. By focusing on disagreement rather than agreement, DISCO creates a tiny, super-efficient test that tells you everything you need to know about an AI's true intelligence, saving time, money, and energy.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →