The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

Imagine you've been teaching a robot to do math. For a long time, you've only tested it on things like "If I have 3 apples and buy 2 more, how many do I have?" or tricky riddles from math competitions. The robot has gotten pretty good at those.

But now, you want to know: Can this robot actually handle a real, advanced university math class? Can it solve the kind of problems a graduate student or a researcher would face, like optimizing complex systems or calculating how fluids move in 3D space?

That's exactly what this paper is about. The authors built a new, super-tough test called CompMath-MCQ to see if today's smartest AI robots are ready for the big leagues.

Here is the story of how they did it, explained simply:

1. The Problem: The Robot Hasn't Been Tested on "Real" Hard Stuff

Until now, most math tests for AI were like:

Elementary School Math: Simple word problems.
Math Olympiads: Tricky puzzles that require a "flash of genius" rather than steady, step-by-step work.
Formal Proofs: Very strict, computer-code-like logic.

But these tests miss the middle ground: Graduate-level Computational Math. This is the stuff you learn in a Master's degree: Linear Algebra, Optimization, and using Python to solve real-world scientific problems.

The authors realized that if we want to know if AI is truly "smart" at math, we need to test it on this specific, high-level material.

2. The Solution: A Brand New, "Leak-Proof" Test

The team created 1,500 brand-new multiple-choice questions.

Why multiple-choice?
Usually, when you ask an AI a hard question, it might give a long, rambling answer that is hard to grade. Did it get the right number but explain it wrong? Did it get the right logic but the wrong number?
By using multiple-choice (A, B, or C), the grading becomes as clear as a light switch: On or Off. It's fair, fast, and impossible to argue with.

Why "Leak-Proof"?
This is the most important part. AI models are trained on massive amounts of data from the internet. If you test them on questions that already exist online (like old textbook problems), the AI might have just "memorized" the answer, not actually learned the math.

The Analogy: Imagine giving a student a test on questions they've already seen on a cheat sheet. They get an A, but they didn't learn anything.
The Fix: The authors wrote every single one of these 1,500 questions themselves. They are professors who teach these subjects. These questions have never been on the internet before. So, if the AI gets them right, it's because it actually understood the math, not because it memorized the answer key.

3. The Quality Control: The "Double-Check" System

Writing hard math questions is tricky. Sometimes a question is confusing, or the "correct" answer is actually wrong.
To fix this, they used a two-step safety net:

The Robot Jury: They asked 8 different AI models to answer the questions. If all the robots got a question wrong, or if they were all confused and picked the same wrong answer, the humans knew something was fishy with the question.
The Human Experts: The professors then manually reviewed those suspicious questions to fix any errors or clarify the wording.

This ensured the test was fair and the answers were definitely correct.

4. The Results: The Robots Are Good, But Not Perfect

They ran the test on the smartest AI models available (like GPT-5, Claude, and various open-source models). Here is what they found:

The "Easy" Wins: The AIs were surprisingly good at Probability (guessing odds) and Python programming. It seems the internet has so much data on these topics that the robots have seen them a million times.
The "Hard" Bottleneck: The AIs struggled the most with Vector Calculus (math involving 3D space, gradients, and complex shapes).
- The Metaphor: Think of it like this: The AI is great at following a recipe (programming) or guessing the weather (probability). But when you ask it to visualize how a river flows around a rock in 3D space and calculate the exact force at every point, it starts to get lost. It makes small sign errors or forgets a step in a long chain of logic.
The Gap: The most advanced "closed" models (like GPT-5 and Claude) did the best, but even they only got about 80-90% right on the hardest topics. They aren't quite ready to replace a human PhD student yet.

The Big Takeaway

This paper is a wake-up call. We are making AI smarter every day, but when it comes to advanced, step-by-step, computational math, the robots still have a lot of growing up to do.

They can solve the riddles, but they are still stumbling over the complex, real-world math that scientists and engineers use every day. The authors released this test to the public so everyone can keep tracking progress and helping these robots learn the hard stuff.

Here is a detailed technical summary of the paper "The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?"

1. Problem Statement

While Large Language Models (LLMs) have demonstrated strong performance on elementary math problems (e.g., GSM8K), competition-style questions (e.g., MATH, Olympiads), and formal theorem proving, there is a significant gap in evaluating them on graduate-level and computational mathematics. Existing benchmarks suffer from three main limitations:

Data Leakage: Many datasets are sourced from public textbooks or online repositories, meaning models may have memorized the answers during pre-training rather than demonstrating true generalization.
Evaluation Bias: Open-ended formats often rely on heuristic matching or subjective human judgment, leading to inconsistent and non-reproducible results.
Scope Mismatch: Current benchmarks focus on "ingenuity" (Olympiads) or "symbolic logic" (formal proofs), neglecting the procedural, technical, and computational reasoning required in applied graduate curricula (e.g., Numerical Optimization, Vector Calculus, Scientific Computing).

2. Methodology

A. Dataset Construction (CompMath-MCQ)

The authors introduced CompMath-MCQ, a new benchmark consisting of 1,500 originally authored multiple-choice questions (MCQs).

Source: All questions were manually written by professors of graduate-level courses to ensure curriculum alignment and zero data leakage (no overlap with existing training corpora).
Domains: The dataset covers five core areas of computational mathematics:
1. Linear Algebra (22.1%)
2. Numerical Optimization (22.0%)
3. Probability (23.3%)
4. Vector Calculus (19.5%)
5. Python-based Scientific Computing (13.1%)
Format: Each question has exactly three options (one correct, two plausible distractors designed to reflect common misconceptions).

B. Validation Framework

To ensure the dataset's quality and internal consistency, a two-stage validation procedure was employed:

Automated Screening (Cross-LLM Disagreement):
- Eight diverse LLMs (3 closed-source, 5 open-source) were run on all questions.
- Metrics: The authors calculated the per-question error rate ( $e_i$ ) and wrong-answer consensus ( $c_i$ ).
- Logic: If multiple models consistently choose the same incorrect answer, it suggests the question is ambiguous or the correct answer is mislabeled. Questions flagged as statistically anomalous (via binomial tests) were prioritized.
Manual Expert Review:
- Flagged questions were manually inspected by the authors to verify mathematical correctness, unambiguous phrasing, and the validity of the correct answer.
- Ambiguous or erroneous items were corrected or removed.

C. Evaluation Protocol

The dataset was evaluated using the lm-eval library to ensure reproducibility.

Open-Weight Models: Evaluated using log-likelihood ranking. Instead of generating text, the model scores the probability of each of the three fixed answer options. The option with the highest length-normalized log-likelihood is selected. This eliminates parsing errors and decoding stochasticity.
Closed-Weight Models: Evaluated via prompt-based generation with strict constraints (e.g., outputting only <Answer>0</Answer>, <Answer>1</Answer>, or <Answer>2</Answer>). Deterministic decoding (temperature $\tau=0$ ) was used.

3. Key Contributions

Leakage-Free Benchmark: The first standardized dataset for graduate-level computational math that is entirely newly authored, eliminating data contamination concerns.
Curriculum Alignment: Focuses on applied and computational topics (Optimization, Vector Calculus, Scientific Computing) relevant to PhD-level research, rather than pure puzzle-solving.
Robust Validation Framework: A novel pipeline combining statistical anomaly detection across multiple models with human expert review to ensure high-quality ground truth.
Objective Evaluation: By utilizing the MCQ format and log-likelihood ranking, the benchmark enables fully automatic, deterministic, and bias-free comparison across different model architectures.

4. Results and Discussion

Baseline evaluations were conducted on state-of-the-art models (e.g., GPT-5, Claude Sonnet 4.5, Qwen3, Llama-3.1).

Overall Performance: Advanced computational reasoning remains a significant challenge. Even the best models (GPT-5, Claude Sonnet 4.5) achieve around 90-91% overall accuracy, indicating room for improvement.
Domain Variability:
- Strongest Areas: Probability and Python yielded the highest scores (often >95% for top models). This suggests current training corpora heavily cover these topics and that code-oriented reasoning ("code-as-thought") aids performance.
- Weakest Area: Vector Calculus was the most challenging domain, with top models scoring significantly lower (e.g., Gemini 3 Flash at 83.5%, Qwen3 30B at 78.6%). Errors frequently involved sign mistakes, incorrect partial derivatives, and chain rule failures.
Model Trends:
- Specialization: Domain-specialized models (e.g., Qwen2.5-Math) often outperformed larger general-purpose models (e.g., Llama-3.1 8B).
- Code Correlation: Models with strong coding capabilities (e.g., Qwen3-Coder 30B) showed superior performance in Python and Linear Algebra, supporting the hypothesis that programming proficiency correlates with computational mathematical reasoning.
- Open vs. Closed: Specialized open-weight models (Qwen3-Coder 30B) are rapidly narrowing the gap with proprietary systems, achieving ~89.4% overall accuracy.

5. Significance

The CompMath-MCQ dataset serves as a critical "missing link" in the LLM evaluation landscape. It moves beyond elementary arithmetic and abstract puzzles to test the technical and conceptual understanding required for real-world scientific computing and graduate research. By providing a leakage-free, reproducible, and curriculum-aligned benchmark, it offers a reliable testing ground for future advancements in mathematical reasoning, specifically highlighting that while LLMs are improving, they still struggle with the rigorous, multi-step symbolic manipulation required in fields like Vector Calculus.