BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

BeyondBench introduces a contamination-resistant evaluation framework that uses on-the-fly algorithmic problem generation to assess the true reasoning capabilities of 101 language models across 44 tasks, revealing significant performance gaps in complex problem-solving and the critical role of tool usage.

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to test how good a student is at math. You give them a worksheet with 100 problems. If the student has seen those exact 100 problems before in their textbook, they might just memorize the answers and get 100%. But that doesn't mean they are actually good at math; it just means they have a good memory.

This is exactly what is happening with Artificial Intelligence (AI) today. The "textbooks" (the internet data used to train AI) are so huge that AI models have likely memorized the answers to the standard tests we use to judge them. They aren't "thinking"; they are just "reciting."

BEYONDBENCH is a new way to test AI that fixes this problem. Here is how it works, explained simply:

1. The Infinite Library (No More Memorization)

Imagine a library where the books are written by a magical machine. Every time you ask for a test, the machine generates a brand new book that has never existed before and will never exist again.

  • Old Way: We used static tests (like a fixed list of math problems). It's like giving the same test to every student. If the test is leaked, everyone cheats.
  • BEYONDBENCH Way: It creates problems on the fly. The number of possible problems is larger than the number of grains of sand on Earth (over $10^{15}$ possibilities). It is statistically impossible for an AI to have memorized the specific problem you just gave it. It forces the AI to actually solve the puzzle, not just recall an answer.

2. The Three Levels of Difficulty (The Video Game Analogy)

The paper organizes these problems into three "levels," like a video game:

  • Easy Level (The Tutorial): Basic arithmetic and counting. Can the AI add numbers or find the biggest number in a list?
  • Medium Level (The Boss Fight): Patterns and sequences. Can the AI figure out the next number in a complex pattern (like a Fibonacci sequence) or solve a number theory puzzle?
  • Hard Level (The Final Dungeon): These are "NP-complete" problems. Think of these as the most complex puzzles in the universe, like Sudoku, Tower of Hanoi (moving disks between pegs), or Graph Coloring (painting a map so no touching countries have the same color). These require deep, step-by-step logical planning.

3. The "Truth Machine" (No Guessing)

In many AI tests, the computer has to guess if the AI's answer is right. Sometimes the AI gives a weird answer that might be right, and the computer gets confused.

BEYONDBENCH uses a "Truth Machine" (mathematical solvers). Before the AI even sees the problem, the computer calculates the exact correct answer.

  • If there is only one right answer, the computer checks for that.
  • If there are multiple right answers (like in some puzzles), the computer lists all of them.
  • If the AI gets it right, it gets credit. If it gets it wrong, it gets zero. No guessing, no ambiguity.

4. The Results: The "Thinking" Illusion

The researchers tested 101 different AI models, from tiny ones to massive super-computers. Here is what they found:

  • The "Thinking" Models are Overthinking: Some new AI models are designed to "think" longer before answering (like taking a deep breath). The paper found that for these complex puzzles, thinking longer often made them worse. They would start solving a puzzle correctly, get confused in the middle, and then try to "fix" their own mistake, which only made it worse. It's like a student who knows the first step of a math problem but then starts second-guessing themselves and writes a completely wrong answer.
  • The "Tool" Users Win: The best performers weren't the ones trying to do everything in their head. They were the ones that knew when to use a calculator or write code.
    • Analogy: Imagine a human trying to solve a Sudoku. If they try to do it entirely in their head, they might fail. But if they are allowed to use a pencil and paper (a tool), they can solve it easily. The best AIs realized, "I can't hold all these numbers in my memory; I need to write them down."
  • The Ceiling: Even the smartest AI models hit a "ceiling" on the hardest puzzles. They could solve about 50-60% of the hardest problems, but they couldn't get to 100%. This suggests that current AI is still missing a fundamental piece of "reasoning" that humans have.

5. Why This Matters

BEYONDBENCH proves that many AI models are illusions of intelligence. They look smart because they memorized the training data, but when you give them a brand new, never-before-seen puzzle, they struggle.

The paper concludes that to build truly intelligent AI (Artificial General Intelligence), we shouldn't just make the models bigger. Instead, we need to build hybrid systems: AIs that can talk and understand language, but also know when to stop and use a calculator, a code interpreter, or a search engine to get the job done.

In short: BEYONDBENCH is the first test that catches AI models cheating by memorizing answers. It shows us that while AI is getting better at talking, it still struggles to truly think through complex, new problems without help.