How Reliable is Language Model Micro-Benchmarking?
This paper challenges the reliability of language model micro-benchmarks by demonstrating that they often fail to consistently rank models with small performance differences, frequently requiring as many as 250 examples to achieve accuracy comparable to random sampling, thereby offering actionable guidance on the trade-off between evaluation efficiency and reliability.