Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

This paper presents the first multi-dimensional evaluation of 31 LLM safety benchmarks, revealing that while they do not outperform non-benchmark papers in academic influence, there is a critical misalignment where neither author prominence nor paper impact correlates with code quality, highlighting a significant need for improved repository readiness and ethical standards.

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine the world of Artificial Intelligence (AI) safety research as a massive, bustling construction site. Every day, new workers (researchers) arrive with blueprints (papers) claiming to have built the safest, most secure tools for the future.

To keep track of who is doing the best work, the site managers use Benchmarks. Think of a benchmark as a standardized "Driver's License Test" for AI. Instead of just saying, "My car is safe," you have to prove it by driving through a specific obstacle course (like avoiding jailbreaks or not hallucinating facts).

This paper, titled "Benchmark of Benchmarks," is like a quality inspector who decided to check the "Driver's License Tests" themselves. They asked: Are these tests actually the gold standard? Are the people who wrote them famous? And most importantly, are the test manuals actually usable?

Here is the breakdown of their findings, using simple analogies:

1. The "Famous Author" Myth 🌟

The Question: Do papers written by famous, highly-cited professors automatically become the most influential benchmarks?
The Finding: Not necessarily.
The Analogy: Imagine a celebrity chef writing a cookbook. You might expect their recipes to be the most popular. But the researchers found that in the AI safety world, a paper written by a "celebrity" isn't necessarily cited more often than a paper written by a regular researcher.

  • The Twist: While famous authors do get more attention for their papers in general, their fame does not translate to better code. Just because the chef is famous doesn't mean their recipe instructions are easy to follow.

2. The "Code Quality" Reality Check 🛠️

The Question: How good is the actual code (the "test kit") that comes with these benchmarks?
The Finding: It's surprisingly messy.
The Analogy: Imagine buying a high-end "DIY Furniture Kit" from a famous brand. You expect the instructions to be clear, the screws to be the right size, and the wood to be pre-drilled.

  • The Reality: The researchers found that only 39% of these "kits" could be assembled without any extra tools or modifications.
  • The "Missing Manual": Only 16% had perfect installation guides. Many were like a box of parts with a note that said, "Good luck, figure it out."
  • The "Ethical Warning": Even worse, only 6% included a warning label. Since these benchmarks often teach people how to "hack" or "break" AI (jailbreaking), the code repositories often lacked safety warnings, like a chemistry set that didn't tell you not to mix certain chemicals.

3. The "Popularity vs. Usability" Disconnect 📉

The Question: Does having a great, working code repository make a paper more popular (more citations)?
The Finding: Yes, but only if it works at all.
The Analogy: Think of a paper as a YouTube tutorial.

  • If the video works and you can follow along, people will watch it and share it (high citations).
  • However, if the video is slightly blurry or the audio is a bit off (minor code quality issues), people still watch it!
  • The Conclusion: Researchers are "pragmatic." They will cite a paper if the code runs, even if the code is messy, poorly documented, or hasn't been updated in months. They don't seem to care about "clean code" as much as they care about "does it work?"

4. The "Open Source" Paradox 📦

The Finding: Benchmark papers are much better at sharing their code than regular papers.
The Analogy: If regular research papers are like private clubs where you have to ask for the recipe, Benchmark papers are like public parks where the blueprints are hanging on a bulletin board.

  • 87% of benchmark papers shared their code, compared to only 44% of non-benchmark papers.
  • But, just because the blueprints are on the wall doesn't mean they are easy to read.

The Big Takeaway 🏁

The authors are essentially saying: "We are building a race car, but the instruction manual is written in a language nobody speaks, and the safety warnings are missing."

  • The Good News: The community is sharing their work openly.
  • The Bad News: The work is often hard to use, hard to install, and lacks ethical guardrails.
  • The Call to Action: The "famous chefs" (top researchers) need to stop just writing the recipes and start writing clear, safe, and easy-to-follow instructions. If they want their benchmarks to be the true standard, they need to make sure anyone can actually use them without needing a PhD in debugging.

In short: The "Driver's License Tests" for AI are becoming popular, but the test centers are often disorganized, the instructions are confusing, and the safety warnings are missing. It's time to clean up the shop.