Imagine a high-stakes cooking competition where four celebrity chefs (the AI models) are asked to recreate complex, multi-course meals based on a recipe book (the AP Physics exams). The judges (physics experts) taste every dish and score them based on how well they followed the recipe, the flavor, and the presentation.
This paper is the report card from that competition. Here's what happened, explained simply:
The Contestants
The researchers invited four of the smartest "chefs" in the AI world to take the AP Physics 1 and 2 exams. These aren't just multiple-choice quizzes; they are the "Olympics" of high school physics, requiring students to solve math problems, draw graphs, explain why things happen, and interpret diagrams.
The four chefs were:
- ChatGPT 4.1 mini
- Gemini 2.5 Flash
- Claude 4.0 Sonnet
- DeepSeek R1
They were told to act exactly like a high school student taking the test, with no special tricks or "cheat sheets" (like asking the AI to "think step-by-step" in a special way). They just had to do their best.
The Scoreboard: How Did They Do?
The Good News:
All four chefs were surprisingly good! On average, they scored between 82% and 92%. If these were human students, they would all be getting A's. They are excellent at the "mathy" parts of physics—plugging numbers into formulas and solving algebraic equations.
The Bad News (The Plot Twist):
While the average scores were high, the consistency was a rollercoaster.
- Physics 1 (Mechanics): It was a total toss-up. One year, Chef A was the winner; the next year, Chef C won. There was no clear "best" chef. It depended entirely on what specific question was asked that day.
- Physics 2 (Electricity, Light, Heat): Here, a clear hierarchy emerged. Gemini and DeepSeek were the most consistent high-achievers. Claude and ChatGPT were good but stumbled more often, especially on the harder questions.
The "Kitchen Disasters": Where They Failed
Even though they got high scores, the judges found some very specific, recurring mistakes. Think of these as the chefs' "signature flaws":
The "Blind" Chef (Diagram Errors):
If the recipe included a picture of a ramp or a circuit, the AI often got confused. It might look at a picture of two blocks sliding down a hill and think, "They start at the same height, so they must finish at the same time!" It missed the visual clue that one hill was steeper. It's like a chef who can read a recipe but can't tell the difference between a picture of a tomato and a picture of an apple.The "Graph Illiteracy" (Chart Errors):
When asked to read a graph (like a line showing how pressure changes), the AI often made up numbers or missed the trend. It's like a chef looking at a temperature gauge and guessing the heat instead of reading the dial.The "Left-Handed" Chef (Direction Errors):
Physics is all about direction (which way is the force pushing?). The AI often got its left and right mixed up, especially with magnetic fields (the "Right-Hand Rule"). It would calculate the math perfectly but point the arrow in the wrong direction, like a GPS that calculates the distance correctly but tells you to drive into a lake.The "One-Note" Chef (Circuit Errors):
When looking at a drawing of an electrical circuit, the AI struggled to tell which wires were connected in a line (series) and which were side-by-side (parallel). It's like trying to figure out a subway map by looking at a tangled ball of yarn.
The Big Takeaway
What does this mean for us?
- AI is a Great Calculator: If you need to solve a math problem or check a formula, these AIs are fantastic tools. They are like a super-fast calculator that never gets tired.
- AI is a Weak Visualizer: If the problem requires looking at a picture, a graph, or imagining a 3D object in space, the AI is still prone to "hallucinations" (making things up). It sees the words but misses the picture.
- The "Chain Reaction" Problem: If the AI misreads the picture at the very beginning, every single step after that is wrong, even if the math is perfect. It's like building a house on a crooked foundation; the walls might be straight, but the whole house will fall over.
The Verdict for Teachers and Students
Teachers shouldn't just let students use AI to do their homework. Instead, they should use the AI's mistakes as teaching moments.
- "Look, the AI got this right, but it drew the arrow the wrong way. Can you spot why?"
- "The AI calculated the number correctly, but it ignored the diagram. What did the diagram tell us?"
In short: These AI systems are brilliant students who are great at math but terrible at reading maps. They can help you study, but you still need a human (or a very careful eye) to make sure they aren't driving you in the wrong direction.