Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

🎓 The Big Idea: The "Final Exam" for AI

Imagine you've been teaching a robot to be a genius. It has read every book in the library and can recite facts faster than a human. But does it actually understand how to solve a real, tricky problem from a college physics class?

The authors of this paper say: "Not really."

They created a new test called CFE-BENCH (Classroom Final Exam). Think of this not as a pop quiz, but as the hard, final exam that real university students take. It uses actual homework and exam questions from real professors in fields like Physics, Math, and Engineering.

The goal? To see if AI models can handle the messy, multi-step logic required in real science, rather than just guessing the right answer from a multiple-choice list.

🧪 The Problem: AI is "Good at Guessing, Bad at Grading"

In the past, AI benchmarks were like multiple-choice tests. If an AI got the final answer right, it got a gold star. But the authors realized this was misleading.

The Analogy: Imagine a student taking a math test.

The AI's old way: It writes a 10-page essay full of correct-sounding sentences, but halfway through, it makes a tiny math error. It keeps going, and by the end, it accidentally guesses the right number.
The Teacher's view: "You got the right number, but your logic was a mess. You failed."

The paper introduces a new way to grade called Variable-Based Verification.

Old Way: Compare the whole essay to the teacher's essay. (Too fuzzy, easy to trick).
New Way: The teacher says, "I don't care about your essay. Just show me the number you got for Step 3 and the formula you used for Step 7."
Result: This catches the AI when it tries to "fake" its way to the answer.

📉 The Results: The AI is Stuck in the Middle

When they ran the world's smartest AI models (like Gemini, GPT, and Qwen) through this "Final Exam," the results were surprising:

The Score: Even the best AI only got about 60% right. That's a "C" grade in college.
The Gap: The best open-source models (free to use) scored around 47%.
The Multimodal Trap: When the questions included images (like circuit diagrams or graphs), the scores dropped even lower.

The Analogy: It's like a student who can memorize the dictionary perfectly but freezes when asked to write a story using those words. They know the parts, but they can't put them together.

🔍 The Diagnosis: Why Do They Fail?

The authors didn't just look at the final score; they acted like detectives to see where the AI broke down. They broke the solutions into small "steps" (like a recipe).

Here are the three big discoveries:

1. The "Step-by-Step" Illusion

Finding: If you ask the AI to do just one small step (e.g., "Calculate the speed of this block"), it gets it right 90% of the time.
Analogy: The AI is like a master chef who can chop an onion perfectly and boil water perfectly. But if you ask it to make a whole 5-course meal, it forgets which pot is on which burner.
Conclusion: The problem isn't that the AI doesn't know the facts; it's that it can't keep track of the story as it gets longer.

2. The "Middle-Step" Bottleneck

Finding: The AI gets lost in the middle of the solution. If you give the AI the answer to the first step, it can finish the rest. But if it has to figure out the middle steps on its own, it starts to drift and make errors.
Analogy: Imagine a hiker with a map. They can walk the first mile and the last mile easily. But in the middle of the forest, they lose the trail, wander in circles, and eventually end up in the wrong valley.
Key Insight: The AI needs intermediate checkpoints. If a human gives it the answer to the middle step, its performance jumps up almost as much as if the human gave it the whole solution.

3. The "Wordy" Problem

Finding: The AI takes way more steps to solve a problem than a human expert does.
Analogy: A human expert solves a puzzle in 10 moves. The AI tries to solve it in 15 moves, adding unnecessary fluff and detours. Every extra step is a chance to make a mistake.
Conclusion: The AI is inefficient. It's like a driver who takes a scenic route with 20 stops instead of the direct highway, just to "think" about the trip.

🚀 What This Means for the Future

The paper concludes that we can't just make AI "smarter" by feeding it more data. We need to change how it thinks.

Stop the Fluff: We need to train AI to be concise and efficient, not just verbose.
Check the Middle: We need to build systems that check the AI's work during the process, not just at the end.
Real-World Testing: We need to stop using easy, multiple-choice tests and start using real, messy, multi-step problems (like this "Final Exam") to see if AI is truly ready for the real world.

In short: The AI is a brilliant student who knows the textbook but fails the final exam because it loses its place in the middle of the problem. To fix it, we need to teach it how to stay on track, step by step.

1. Problem Statement

Despite rapid advancements in Large Language Models (LLMs) and multimodal foundation models, existing benchmarks are increasingly saturated and often fail to distinguish between models capable of superficial pattern matching and those possessing deep, domain-grounded reasoning abilities.

Limitations of Current Benchmarks: Many rely on synthetic data or multiple-choice formats that do not require complex, multi-step derivations. They often lack rigorous verification of intermediate reasoning steps, leading to "false positives" where models generate fluent but incorrect long-form answers.
The Gap: Frontier models struggle with authentic, college-level STEM problems that require integrating vast domain knowledge with multi-step logical derivation. There is a need for a benchmark that reflects realistic academic standards, includes both text-only and multimodal inputs, and offers a diagnostic view into why models fail.

2. Methodology: CFE-BENCH Construction

The authors introduce CFE-BENCH (Classroom Final Exam), a benchmark curated from authentic, repeatedly used university homework and exam problems verified by course instructors.

Data Collection & Curation:
- Source: Publicly available course materials (exams, quizzes, homework) from over 20 STEM domains (Physics, Math, Engineering, CS, etc.).
- Filtering: Items were selected to be well-posed, objectively verifiable, and non-trivial (avoiding simple Yes/No or multiple-choice). They exclude problems requiring physical experiments.
- Scale: 449 high-quality problems total:
  - Text-only split: 305 questions (dominated by Physics and Math).
  - Multimodal split: 144 questions (requiring interpretation of diagrams, plots, and schematics).
Expert Annotation:
- 17 graduate-level experts reviewed and annotated the data.
- Reasoning Flows: Each problem was decomposed into a structured sequence of verifiable reasoning units ( $R = [u_1, u_2, ..., u_n]$ ), where each unit is a sub-question paired with a ground-truth answer.
Evaluation Protocol: Variable-Based Verification (S2S)
- To avoid the pitfalls of "Long-to-Long" (L2L) comparison (which can be fooled by fluent but incorrect narratives), the authors propose a Short-to-Short (S2S) protocol.
- Process:
  1. Extraction: A judge model extracts specific target variables (e.g., final formulas, numeric values) from the model's long-form response based on annotated variable descriptions.
  2. Verification: The extracted values are compared strictly against ground-truth values.
  3. Scoring: A response is marked correct only if all annotated variables are verified as correct.
- This method reduces false positives and provides a fine-grained metric for partial progress.

3. Key Contributions

Benchmark Release: CFE-BENCH is the first unsaturated, instructor-verified benchmark for multimodal STEM reasoning, covering a long tail of disciplines with realistic difficulty.
Novel Evaluation Metric: The introduction of Variable-Based Verification (S2S) offers a more rigorous, automated, and discriminative evaluation method than traditional semantic matching.
Diagnostic Framework: The authors developed a step-wise diagnostic approach to deconstruct model failures into:
- Atomic Competence: Can the model solve a single sub-step?
- Compositional Failure: Can the model chain steps together?
- Intermediate State Maintenance: Does the model drift from correct intermediate values?
Efficiency Analysis: A comparative study of reasoning step density between human experts and AI models.

4. Key Results

The benchmark was evaluated on a wide range of state-of-the-art models (including Gemini, Qwen, Claude, GPT-5, and DeepSeek).

Overall Performance:
- Even the strongest model, Gemini-3.1-pro-preview, achieved only 59.69% question accuracy on the combined dataset.
- The best open-source model, Qwen3.5, reached 47.44%.
- Performance drops significantly in the multimodal setting compared to text-only, highlighting a gap in visual reasoning.
Diagnostic Findings:
- Atomic Competence is High: Models can often answer individual sub-questions correctly when the sub-problem is explicitly specified (Unit Execution Accuracy $\approx$ 0.8–0.9).
- Intermediate State Failure: The primary bottleneck is not a lack of knowledge, but the inability to derive and maintain correct intermediate states over long derivations.
- Critical Intermediate Units: Providing a single correct intermediate answer (and its value) to the model improves final accuracy nearly as much as providing a long prefix of sub-questions. This suggests the "critical signal" is the correct intermediate value, not just the reasoning structure.
- Inefficiency: Model-generated solutions are significantly longer than expert ground truths (approx. 14–18% more steps). This "length inflation" creates more opportunities for error accumulation and state drift.

5. Significance and Implications

Beyond Saturation: CFE-BENCH proves that current frontier models are far from mastering complex, multi-step academic reasoning, even when they perform well on standard benchmarks.
Training Directions: The results suggest that future model improvements should focus on:
- Intermediate Supervision: Training objectives that penalize redundant steps and reward the accuracy of intermediate values, not just the final answer.
- Hybrid Systems: Integrating LLMs with symbolic solvers or verified calculators to handle critical intermediate computations, reducing the risk of drift.
- Efficiency: Developing reasoning strategies that are more compact and less prone to error accumulation.
Realism: By using instructor-verified materials, CFE-BENCH provides a realistic testbed for evaluating whether AI can truly replace or assist in higher education and professional STEM tasks.

In conclusion, CFE-BENCH shifts the focus from "can the model answer?" to "can the model reason reliably through a complex process?" revealing that the current frontier of AI reasoning is limited by its inability to maintain state fidelity over long chains of thought.