Are Large Language Models Truly Smarter Than Humans?

Imagine a group of students taking a very important exam to prove they are geniuses. The test is called the "MMLU," and it covers everything from law and medicine to physics and philosophy.

Recently, AI models (like the super-smart chatbots you might have heard of) have been getting near-perfect scores on this exam. People are saying, "Wow, these AIs are smarter than human experts!"

But this paper asks a very simple, nagging question: "Did these students actually study the material, or did they just memorize the answer key?"

The authors of this paper decided to investigate three different ways to find out if the AI is truly "smart" or just "cheating" by having seen the test questions before. Here is what they found, explained simply.

The Three Investigations

1. The "Google Search" Test (Experiment 1)

The Idea: If a student memorized the test, the questions must be somewhere on the internet.
The Method: The researchers took 513 questions from the exam and Googled them. They checked if the exact questions and answers were floating around online in textbooks, lecture notes, or blog posts.
The Result: They found that 13.8% of the questions were definitely "contaminated" (meaning the AI likely saw them during training).

The Analogy: It's like finding out that 1 in 7 questions on a final exam were posted on a public forum where the teacher accidentally left the answer key.
The Surprise: The "STEM" subjects (Science, Tech, Engineering, Math) were the most contaminated. Even worse, in subjects like Philosophy, up to 66% of the questions were found online. It seems the AI didn't just learn the concepts; it memorized the specific questions.

2. The "Rewrite" Test (Experiment 2)

The Idea: If a student truly understands a concept, they can answer it even if the teacher rewrites the question. If they just memorized the answer, they will get confused if the words change.
The Method: The researchers took 100 questions and rewrote them.

Original: "What is the capital of France?"
Rewritten: "Which city serves as the seat of government for the French Republic?"
They asked the AI the same questions but with different wording.
The Result: When the words changed, the AI's scores dropped significantly.
The Analogy: Imagine a student who can recite the poem "The Road Not Taken" perfectly. But if you ask them, "What is the poem about the traveler who took the less traveled path?" they freeze. They memorized the words, not the meaning.
The Big Drop: In Law and Ethics, the AI's score dropped by nearly 20%. This suggests that in these critical fields, the AI was mostly just recognizing the specific phrasing of the question, not actually understanding the legal or moral reasoning.

3. The "Fill-in-the-Blank" Test (Experiment 3)

The Idea: If the AI has memorized the test, it should be able to fill in the blanks of a question it has seen before, even if you hide a word.
The Method: They took the questions and covered up a key word or a wrong answer choice. They asked the AI to guess what was hidden.
The Result: The AI was able to guess the hidden parts 72.5% of the time.

The Analogy: It's like showing a student a sentence with a blank: "The capital of France is [____]." If they say "Paris" without thinking, they didn't just guess; they remembered the sentence structure.
The Weird Anomaly: One model, DeepSeek-R1, was strange. It couldn't remember the exact words (it couldn't fill in the blank perfectly), but it could guess the idea of the missing word. It was like a student who forgot the exact vocabulary but remembered the general story. This explained why it did poorly on the "Rewrite" test but still seemed to know the material.

The Big Picture: What Does This Mean?

The paper concludes that AI is not necessarily "smarter" than humans yet; it is just very good at recognizing patterns it has seen before.

The "Exam" is Leaked: The tests we use to measure AI intelligence are like open-book exams where the answers are posted on the internet. The AI has read the internet, so it has seen the test.
Memorization vs. Understanding: High scores on these tests often mean the AI has memorized the "script" of the question, not that it has the deep, flexible reasoning of a human expert.
The Danger: If we trust these AI models for real-world jobs (like being a lawyer or a doctor) based on these scores, we are in trouble.
- The Metaphor: Imagine hiring a pilot who has scored 100% on a simulator test, but only because they memorized the specific flight path of the test. If the wind changes or a new obstacle appears (a "paraphrased" question), they might crash.

The Takeaway

The authors aren't saying AI is useless. They are saying we need to stop pretending these public tests are a fair measure of "human-level intelligence."

To truly know if AI is smart, we need to give it a test it has never seen before, with questions written in a way it has never encountered. Until we do that, the high scores we see on leaderboards might just be the sound of a very good parrot repeating what it heard, rather than a genius thinking for itself.

Are Large Language Models Truly Smarter Than Humans?

The Three Investigations

1. The "Google Search" Test (Experiment 1)

2. The "Rewrite" Test (Experiment 2)

3. The "Fill-in-the-Blank" Test (Experiment 3)

The Big Picture: What Does This Mean?

The Takeaway

1. Problem Statement

2. Methodology

Experiment 1: Lexical Contamination Detection (External)

Experiment 2: Paraphrase and Indirect-Reference Diagnostic (Behavioral)

Experiment 3: TS-Guessing Behavioral Probe (Internal)

3. Key Contributions

4. Key Results

Experiment 1 Results (Lexical)

Experiment 2 Results (Behavioral Degradation)

Experiment 3 Results (Internal Memorization)

5. Significance and Implications

Scientific Validity

Policy and Regulation

Future Directions

Conclusion

Are Large Language Models Truly Smarter Than Humans?

The Three Investigations

1. The "Google Search" Test (Experiment 1)

2. The "Rewrite" Test (Experiment 2)

3. The "Fill-in-the-Blank" Test (Experiment 3)

The Big Picture: What Does This Mean?

The Takeaway

1. Problem Statement

2. Methodology

Experiment 1: Lexical Contamination Detection (External)

Experiment 2: Paraphrase and Indirect-Reference Diagnostic (Behavioral)

Experiment 3: TS-Guessing Behavioral Probe (Internal)

3. Key Contributions

4. Key Results

Experiment 1 Results (Lexical)

Experiment 2 Results (Behavioral Degradation)

Experiment 3 Results (Internal Memorization)

5. Significance and Implications

Scientific Validity

Policy and Regulation

Future Directions

Conclusion

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents