How Well Do AI Systems Solve AP Physics? A Comparative Evaluation of Large Language Models on Algebra-Based Free Response Questions

Imagine a high-stakes cooking competition where four celebrity chefs (the AI models) are asked to recreate complex, multi-course meals based on a recipe book (the AP Physics exams). The judges (physics experts) taste every dish and score them based on how well they followed the recipe, the flavor, and the presentation.

This paper is the report card from that competition. Here's what happened, explained simply:

The Contestants

The researchers invited four of the smartest "chefs" in the AI world to take the AP Physics 1 and 2 exams. These aren't just multiple-choice quizzes; they are the "Olympics" of high school physics, requiring students to solve math problems, draw graphs, explain why things happen, and interpret diagrams.

The four chefs were:

ChatGPT 4.1 mini
Gemini 2.5 Flash
Claude 4.0 Sonnet
DeepSeek R1

They were told to act exactly like a high school student taking the test, with no special tricks or "cheat sheets" (like asking the AI to "think step-by-step" in a special way). They just had to do their best.

The Scoreboard: How Did They Do?

The Good News:
All four chefs were surprisingly good! On average, they scored between 82% and 92%. If these were human students, they would all be getting A's. They are excellent at the "mathy" parts of physics—plugging numbers into formulas and solving algebraic equations.

The Bad News (The Plot Twist):
While the average scores were high, the consistency was a rollercoaster.

Physics 1 (Mechanics): It was a total toss-up. One year, Chef A was the winner; the next year, Chef C won. There was no clear "best" chef. It depended entirely on what specific question was asked that day.
Physics 2 (Electricity, Light, Heat): Here, a clear hierarchy emerged. Gemini and DeepSeek were the most consistent high-achievers. Claude and ChatGPT were good but stumbled more often, especially on the harder questions.

The "Kitchen Disasters": Where They Failed

Even though they got high scores, the judges found some very specific, recurring mistakes. Think of these as the chefs' "signature flaws":

The "Blind" Chef (Diagram Errors):
If the recipe included a picture of a ramp or a circuit, the AI often got confused. It might look at a picture of two blocks sliding down a hill and think, "They start at the same height, so they must finish at the same time!" It missed the visual clue that one hill was steeper. It's like a chef who can read a recipe but can't tell the difference between a picture of a tomato and a picture of an apple.
The "Graph Illiteracy" (Chart Errors):
When asked to read a graph (like a line showing how pressure changes), the AI often made up numbers or missed the trend. It's like a chef looking at a temperature gauge and guessing the heat instead of reading the dial.
The "Left-Handed" Chef (Direction Errors):
Physics is all about direction (which way is the force pushing?). The AI often got its left and right mixed up, especially with magnetic fields (the "Right-Hand Rule"). It would calculate the math perfectly but point the arrow in the wrong direction, like a GPS that calculates the distance correctly but tells you to drive into a lake.
The "One-Note" Chef (Circuit Errors):
When looking at a drawing of an electrical circuit, the AI struggled to tell which wires were connected in a line (series) and which were side-by-side (parallel). It's like trying to figure out a subway map by looking at a tangled ball of yarn.

The Big Takeaway

What does this mean for us?

AI is a Great Calculator: If you need to solve a math problem or check a formula, these AIs are fantastic tools. They are like a super-fast calculator that never gets tired.
AI is a Weak Visualizer: If the problem requires looking at a picture, a graph, or imagining a 3D object in space, the AI is still prone to "hallucinations" (making things up). It sees the words but misses the picture.
The "Chain Reaction" Problem: If the AI misreads the picture at the very beginning, every single step after that is wrong, even if the math is perfect. It's like building a house on a crooked foundation; the walls might be straight, but the whole house will fall over.

The Verdict for Teachers and Students

Teachers shouldn't just let students use AI to do their homework. Instead, they should use the AI's mistakes as teaching moments.

"Look, the AI got this right, but it drew the arrow the wrong way. Can you spot why?"
"The AI calculated the number correctly, but it ignored the diagram. What did the diagram tell us?"

In short: These AI systems are brilliant students who are great at math but terrible at reading maps. They can help you study, but you still need a human (or a very careful eye) to make sure they aren't driving you in the wrong direction.

Here is a detailed technical summary of the paper "How Well Do AI Systems Solve AP Physics? A Comparative Evaluation of Large Language Models on Algebra-Based Free Response Questions."

1. Problem Statement

While Large Language Models (LLMs) have shown promise in STEM education, their performance on complex, multi-faceted free-response questions (FRQs)—specifically those requiring the integration of diagrams, graphs, algebraic calculation, and qualitative reasoning—remains underexplored. Standardized benchmarks often rely on multiple-choice questions or well-defined textbook problems, which fail to capture the open-ended nature of authentic physics assessments.

The authors aim to address this gap by:

Systematically evaluating four leading commercial LLMs on AP Physics 1 and AP Physics 2 (algebra-based) FRQs from 2015 to 2025.
Determining if a consistent performance hierarchy exists among models.
Identifying specific reasoning failure modes (e.g., spatial reasoning, visual interpretation) that limit current AI capabilities in physics education.

2. Methodology

Data Collection:

Dataset: Free-response questions from AP Physics 1 and 2 exams administered between 2015 and 2025 (excluding 2020 due to pandemic format changes).
Scope: The questions cover kinematics, dynamics, energy, momentum, rotational motion, electricity, circuits, waves, and modern physics. They require quantitative calculation, experimental design, graph construction, and qualitative explanation.

Models Evaluated:
Four widely accessible LLMs were tested:

ChatGPT 4.1 mini (OpenAI)
Gemini 2.5 Flash (Google DeepMind)
Claude 4.0 Sonnet (Anthropic)
DeepSeek R1 (DeepSeek AI)

Experimental Protocol:

Prompting: A standardized "student persona" prompt was used to emulate exam conditions. The AI was instructed to act as a high school student, using only provided information, showing step-by-step reasoning, and avoiding AI self-identification. No advanced techniques (e.g., chain-of-thought scaffolding or few-shot examples) were used to reflect base-level performance.
Scoring: Responses were evaluated by three independent physics experts (Ph.D. level or extensive teaching experience) using official College Board scoring rubrics.
Reliability: Inter-rater reliability was quantified using the Intraclass Correlation Coefficient (ICC) and Cronbach's alpha.
Statistical Analysis:
- Descriptive: Mean scores, standard deviation (SD), and Coefficient of Variation (CV).
- Comparative: Friedman test (non-parametric repeated-measures ANOVA) to detect systematic differences across years.
- Post-hoc: Wilcoxon signed-rank tests with Bonferroni correction for pairwise comparisons.
- Effect Sizes: Cohen's d and Kendall's W (coefficient of concordance).

3. Key Results

Overall Performance:

All models achieved high mean scores (82%–92%), indicating strong capability in structured algebraic problem solving.
AP Physics 1: No statistically significant difference was found among the four models ( $p = 0.141$ , Kendall's $W = 0.182$ ). Performance rankings were unstable, with models frequently swapping positions year-to-year.
AP Physics 2: Statistically significant differences emerged ( $p = 0.0012$ $p = 0.0012$ , Kendall's $W = 0.532$ $W = 0.532$ ).
- Top Performers: Gemini (91.2%) and DeepSeek (92.0%) demonstrated the highest consistency and accuracy.
- Lower Performers: Claude (84.1%) and ChatGPT (82.5%) scored lower.
- Significance: Post-hoc tests confirmed Gemini and DeepSeek significantly outperformed Claude. While ChatGPT scored lower numerically, the difference was not statistically significant due to high variability.

Consistency and Variability:

Physics 1: High year-to-year variability (CV: 9.7%–12.3%) for all models.
Physics 2: DeepSeek and Gemini showed exceptional consistency (CV: 4.7% and 6.3%, respectively). ChatGPT exhibited the highest volatility (CV: 12.6%), with scores ranging from 70% to nearly 100% depending on the exam year.

Qualitative Error Analysis (Failure Modes):
The study identified six recurring error categories across all models, particularly in years with lower scores:

Diagram Interpretation: Misidentifying geometric relationships, motion states, or spatial positions (e.g., failing to recognize how ramp steepness affects acceleration time).
Graph Construction/Reading: Inability to extract precise numerical values from graphs or construct graphs with correct scaling, equilibrium points, and trends.
Vector Direction: Incorrect assignment of force, field, or motion directions (e.g., gravitational interactions, electric fields).
Reasoning Inconsistencies: Discrepancies between qualitative explanations and quantitative calculations (e.g., omitting potential energy terms in Bernoulli's equation).
Circuit Topology: Misclassifying series/parallel relationships in schematic diagrams, leading to incorrect equivalent resistance and current rankings.
Right-Hand Rule: Systematic failures in applying 3D vector rules for magnetic forces and induced currents.

4. Key Contributions

Granular Benchmarking: Provides one of the first longitudinal, rubric-based evaluations of LLMs on authentic, multi-modal physics FRQs, moving beyond multiple-choice benchmarks.
Statistical Rigor: Uses robust non-parametric statistics (Friedman, Wilcoxon) and inter-rater reliability metrics to establish the validity of AI performance comparisons.
Error Taxonomy: Systematically categorizes specific reasoning failures (spatial, visual, topological) that are not captured by aggregate scores, highlighting the "black box" limitations of current LLMs in physics.
Model Differentiation: Demonstrates that while models may appear comparable on average, their performance stability and specific strengths vary significantly based on the exam's conceptual demands (e.g., Physics 2's reliance on integration vs. Physics 1's mechanics).

5. Significance and Implications

For Educators:

Cautious Integration: AI is effective for algebraic manipulation and routine problem solving but unreliable for tasks requiring spatial reasoning, diagram interpretation, or 3D visualization.
Pedagogical Tool: AI-generated errors can be used as "worked examples of misconceptions" to teach students how to identify and correct reasoning flaws, particularly in graphing and vector analysis.

For Developers:

Targeted Improvements: Future models need enhanced multimodal grounding (accurate extraction of data from images/graphs) and improved spatial reasoning capabilities.
Self-Consistency: Mechanisms to verify internal consistency (e.g., checking if a calculated vector direction matches the physical setup) are critical to reducing error propagation.

For the Field of AI:

The study establishes a baseline for tracking the evolution of AI in scientific reasoning. It suggests that while LLMs have mastered pattern matching and linguistic fluency, they still struggle with the coherent integration of physical principles across visual and mathematical domains, a hallmark of genuine scientific reasoning.

In conclusion, while contemporary AI systems can support physics education, they are not yet a substitute for human expertise in complex, multi-representational problem solving. Their utility is currently bounded by significant limitations in visual interpretation and spatial inference.

How Well Do AI Systems Solve AP Physics? A Comparative Evaluation of Large Language Models on Algebra-Based Free Response Questions

The Contestants

The Scoreboard: How Did They Do?

The "Kitchen Disasters": Where They Failed

The Big Takeaway

The Verdict for Teachers and Students

1. Problem Statement

2. Methodology

3. Key Results

4. Key Contributions

5. Significance and Implications

More like this

Three-loop renormalization of the N=1, N=2, N=4 supersymmetric Yang-Mills theories

Limits of conformal images and conformal images of limits for planar random curves

Simplified energy landscape of the ϕ4ϕ^4ϕ4 model and the phase transition

UST branches, martingales, and multiple SLE(2)

Delocalization of the height function of the six-vertex model

Simplified energy landscape of the $ϕ^4$ model and the phase transition