Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs
This paper introduces a five-level expert-curated evaluation framework to demonstrate that while large language models excel at explicit derivations in quantum field theory and string theory, they systematically fail when tasks require reconstructing tacit reasoning or maintaining global conceptual consistency, thereby revealing the limitations of current evaluation paradigms for highly abstract theoretical physics.