Exploring the potential of ChatGPT for feedback and… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a teacher grading 57 physics lab reports. Each report is a messy mix of handwritten notes, typed text, complex math equations, and hand-drawn graphs. It's a lot of work! Now, imagine you have a super-smart, tireless robot assistant (ChatGPT) that can read all these reports in seconds and give you a grade and feedback.

That's exactly what this paper investigates. The researchers asked: "Can this AI robot be a fair and accurate grader for physics lab reports, or is it just a fancy guesser?"

Here is the breakdown of their findings, using some everyday analogies.

1. The Setup: The Robot vs. The Human

The researchers took 57 real student reports from a university in Uruguay. They fed them into the AI (specifically a version called GPT-5.4) using a strict "grading rubric" (a checklist of rules, like a recipe for grading). They also had human teachers grade the same reports. Then, they compared the two.

The Result: The robot and the human teachers didn't really agree.

The Score Gap: On average, the human teachers gave an 8.6, while the robot gave a 7.9.
The Ranking: If you lined up the reports from best to worst, the robot's order was only weakly related to the teachers' order. It was like two people trying to sort a deck of cards; they ended up with very different stacks.

2. Where the Robot Shined: The "Formatting Inspector"

The robot was surprisingly good at checking the structure of the report.

Analogy: Think of the robot as a strict librarian.
What it did well: It could easily tell if a report had an "Objectives" section, a "Theory" section, and a "Conclusion." It checked if the student followed the rules of the game (like using the right headings).
The Verdict: If the report looked neat and followed the checklist, the robot said, "Good job, you followed the rules!"

3. Where the Robot Stumbled: The "Math & Graphs Blind Spot"

This is where things got messy. Physics isn't just about writing words; it's about numbers, graphs, and equations.

Analogy: Imagine the robot is trying to read a comic book, but the pages are being scanned by a broken photocopier that smears the ink and drops the pictures.
The Problem: The reports contained graphs, tables, and math formulas. When the AI tried to read the PDF, it often couldn't "see" the graphs or read the math correctly.
- The "Blind" Mistake: Sometimes the robot would say, "I can't see the graph," and give a low score. Other times, it would confidently guess what the graph said, get it wrong, and give a high score anyway.
- The "Hallucination": In some cases, the robot made up reasons for a grade. It would say, "The student did a great job with the uncertainty analysis," even though the robot couldn't actually read the math to verify it. It was like a student guessing the answer on a test because they forgot their glasses.

4. The Two Types of "Robot Errors"

The researchers found two main ways the robot messed up:

The "I Can't See It" Error (Explicit): The robot admitted, "Hey, this graph is blurry, I can't read it." This is honest, but it means the robot can't grade that part.
The "I Think I Know" Error (Inferred): This is the dangerous one. The robot looked at a smudged equation, guessed what it meant, and confidently graded it. It was like a detective solving a crime based on a blurry photo and being 100% sure they caught the right person, even though they might be wrong.

5. The "Chat" Experiment

The researchers tried talking to the robot one-on-one (conversational mode) instead of just dumping all the reports on it at once.

Analogy: Instead of handing the robot a stack of 57 papers and saying "Grade these," they sat down with the robot and said, "Hey, look at this specific graph in this report. What do you see?"
The Result: When the robot could focus on one thing at a time and ask clarifying questions, it got much better at understanding the math and graphs. It was like taking off the blindfold.

The Big Takeaway

Can AI replace the physics teacher?
No. Not yet.

Can AI help the physics teacher?
Yes, but with supervision.

Think of the AI as a junior teaching assistant.

What it's good at: It can quickly check if the report has all the right sections, if the writing is clear, and if the student followed the basic rules. It can save the teacher time on the "boring" stuff.
What it's bad at: It cannot reliably judge the deep physics reasoning, the math calculations, or the interpretation of complex graphs. It needs a human to double-check its work, especially the tricky parts.

The Final Lesson:
If you use AI to grade physics labs, you must keep a human in the loop. The AI is a powerful tool for organization and feedback on structure, but when it comes to the "soul" of physics (the math and the data), the human teacher is still the only one who can truly see what's happening.

1. Problem Statement

The evaluation of laboratory reports in physics education is a complex, resource-intensive task requiring the assessment of multidimensional evidence: written explanations, mathematical reasoning, data analysis, and graphical interpretation. While Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) show promise in automating feedback, their application to physics laboratory reports remains under-explored. Key challenges include:

Complexity of Evidence: Reports contain non-textual elements (equations, tables, figures) that are difficult for standard text-based models to interpret accurately.
Reliability: There is a lack of understanding regarding whether AI can consistently apply domain-specific grading rubrics, particularly for technical reasoning and uncertainty analysis.
Integration: The pedagogical balance between automation and expert validation needs clarification to ensure the integrity of academic assessment.

2. Methodology

The study employed an exploratory, mixed-methods approach to evaluate the performance of GPT-5.4 (a hypothetical future model version, as indicated by the paper's 2026 date) in grading laboratory reports.

Context & Data:
- Course: Experimental Physics I at the Universidad de la República (Uruguay).
- Sample: 57 anonymized laboratory reports (out of 150 submissions) from the "Reaction Time and Statistics" experiment.
- Format: Reports were submitted as PDFs containing text, equations, tables, and figures.
Experimental Design:
- Rubric: A standardized 10-point rubric used by human instructors was adapted into a structured prompt for the AI.
- Modality 1: Automated Batch Grading (API): A script sent the rubric instructions and the raw PDF content (via text extraction and OCR) to the GPT-5.4 API. The AI generated scores and feedback for six criteria: Objectives, Theoretical Background, Experimental Setup, Data Analysis, Conclusions, and Overall Assessment.
- Modality 2: Conversational Diagnostic: A subset of cases where the batch grading failed (classified as "invalid") was re-evaluated via interactive, targeted prompting to test if specific evidence could be recovered.
Analysis Metrics:
- Quantitative: Spearman rank correlation ( $\rho$ ) to measure agreement in ranking; Mean Absolute Error (MAE) to measure score deviation; comparison of mean scores.
- Qualitative: Human coders (experienced physics teachers) classified AI feedback into three categories:
  1. Correct Application: Feedback justified by explicit evidence from the report.
  2. Reasonable but Superficial: Plausible feedback lacking specific traceable evidence.
  3. Invalid Evaluation: Scores or comments unsupported by the report (due to hallucination or inaccessible evidence).
- Evidence Accessibility: Distinguished between explicit limitations (AI admitted it couldn't read a figure) and inferred limitations (AI misinterpreted distorted OCR/math).

3. Key Results

A. Quantitative Discrepancies

Correlation: The Spearman rank correlation between AI and instructor scores was weak ( $\rho = 0.38$ ), indicating poor agreement in the relative ordering of reports.
Score Deviation: The AI assigned systematically lower scores than instructors (Mean AI: 7.91 vs. Mean Instructor: 8.63).
Error Magnitude: The Mean Absolute Error (MAE) was 1.01, representing a significant discrepancy on a 10-point scale.

B. Qualitative Feedback Analysis

High Performance on Formal Aspects: The AI performed well on "Objectives" (87% correct) and "Theoretical Background" (89% correct), successfully identifying structural elements and general concepts.
Weakness in Technical Reasoning: Performance dropped in "Data Analysis" and "Conclusions."
- Superficiality: A significant portion of feedback (8–13%) was "reasonable but superficial," offering general validation without citing specific equations or data points.
- Invalid Evaluations: 6–11% of feedback was invalid. This was primarily driven by evidence accessibility issues.
The OCR/Extraction Bottleneck:
- Explicit Limitations: The AI correctly identified when it could not read figures or illegible formulas.
- Inferred Limitations (Critical Failure): The AI often confidently evaluated mathematical expressions or graphs that were distorted during OCR extraction (e.g., misreading a fraction bar, missing units, or failing to see a graph referenced in the text). This led to hallucinated justifications for scores.

C. Conversational Mode Findings

When the researchers switched to a conversational mode and directed the AI to specific, problematic elements (e.g., "Analyze the equation in Figure 2"), the model could often recover the missing context and provide valid feedback. This suggests the failure was not purely a lack of reasoning capability, but a failure in the batch processing pipeline to reliably extract and present complex multimodal data.

4. Key Contributions

Empirical Evidence of Limitations: The study provides concrete data showing that while AI can grade formal report structures, it currently lacks the reliability to assess technical physics reasoning (uncertainty propagation, graphical analysis) in a fully automated batch setting.
Taxonomy of AI Errors: The authors introduce a distinction between explicit and inferred evidence limitations, highlighting that "confident hallucinations" based on distorted OCR data are a more dangerous failure mode than simple admission of inability.
Interaction Modality Impact: The study demonstrates that the mode of interaction (batch vs. conversational) significantly alters evaluation quality, suggesting that AI grading in physics may require iterative, targeted prompting rather than one-shot batch processing.
Pedagogical Framework: It establishes that AI should be viewed as a supplementary tool for workload reduction (checking structure, formatting, and consistency) rather than a replacement for instructor grading in experimental physics.

5. Significance and Implications

For Physics Education: The findings caution against the uncritical adoption of AI for high-stakes grading in physics labs. The inability to reliably interpret mathematical notation and graphical data undermines the validity of automated scores for core learning outcomes.
For AI Development: The research highlights the need for multimodal LLMs with superior Optical Character Recognition (OCR) specifically tuned for scientific notation (LaTeX, equations, graphs) to be viable in STEM education.
For Implementation: The study advocates for a human-in-the-loop approach. AI can effectively handle the "heavy lifting" of initial screening and formal checks, but human supervision is essential to validate technical reasoning, interpret complex data visualizations, and ensure the accuracy of feedback regarding physical concepts.

In conclusion, while GPT-5.4 (in this study's context) offers a consistent framework for evaluating report structure, its current limitations in processing scientific evidence make it unsuitable as a standalone grading mechanism for experimental physics laboratory reports.

Exploring the potential of ChatGPT for feedback and evaluation in experimental physics