Perfect score on IPhO 2025 theory by Gemini agent

Imagine the International Physics Olympiad (IPhO) as the "Olympics of the Mind" for high school students. It's the hardest physics test in the world, where the smartest teenagers from over 100 countries compete to solve incredibly complex puzzles involving gravity, light, and energy. Gold medals are awarded to the top one-twelfth (about 8.3%) of contestants — but even among gold medalists, a perfect score is exceedingly rare. In fact, zero human contestants achieved a perfect score at IPhO 2025.

This paper is a report card from a computer program (an AI agent) that took this test in 2025. The result? It got a perfect score, 100%, every single time.

Here is the story of how they did it, explained simply:

1. The Star Player: "Gemini 3.1 Pro"

Think of the AI model used here as a super-intelligent student who has read almost every book in the library. The researchers used a very new version of Google's AI, called Gemini 3.1 Pro Preview.

The Catch: This AI was released after the test happened. This means the AI might have seen the questions during its training — like a student whose teacher might have accidentally given them the answer key beforehand. The authors acknowledge this is a possibility (called "data contamination"), but note it's not certain — the model's knowledge cutoff actually predates the exam. Either way, they argue that solving these problems perfectly is still a huge deal, especially given the leap from previous models.

2. The Strategy: "The Debate Club"

The researchers didn't just ask the AI to "solve this." They built a special Agent (a smart assistant) that uses a trick called Parallel Thinking.

Imagine you are trying to solve a tricky math problem. Instead of just writing down one answer, you ask four different friends to solve it on their own.

Round 1: The AI generates four different solutions for the same problem.
Round 2: It acts like a referee. It looks at the four answers, finds the mistakes in the wrong ones, and combines the best parts of the correct ones into a single, perfect solution.

It's like having a team of detectives where they compare notes to make sure no one missed a clue. This parallel thinking and synthesis process is what allowed the AI to fix its own mistakes and get a perfect score. (Note: the author remarks that a consensus approach may also be effective with other base models, but it was not the approach used in this agent.)

3. The Tool: "The Digital Ruler"

Physics problems often have diagrams (pictures of planets, circuits, or springs). Humans can look at a picture and guess the numbers, but computers often struggle to "see" a ruler on a screen.

The Problem: If the AI guesses the length of a line in a drawing, it might be slightly off, which ruins the whole calculation.
The Fix: The researchers gave the AI a Python code tool. Instead of just "looking" at the picture, the AI wrote a tiny computer program to measure the pixels on the screen with mathematical precision. It's like giving the AI a digital ruler instead of asking it to guess with its eyes.

4. The "Clean-Up Crew" (Fixing the Test)

One of the most interesting parts of this paper isn't just the AI winning; it's that the AI helped the researchers find mistakes in the test itself.

Before running the AI, the researchers used it to check the official exam questions. They found three errors:

A Graph Error: One graph showed a curve that decayed faster than was physically possible. The marking scheme was adjusted so that full points were awarded if the calculated mass had a relative error of 25% or less compared to the specific data point the agent used.
A Frequency Shift Error: A diagram showed a star moving away (redshift), but the frequency shift graph contradicted this by showing the opposite direction. The axis label was corrected.
A Math Error: The official answer key had a calculation mistake.

The AI spotted these errors, and the researchers fixed the "official" test before grading the AI. This shows that the AI is becoming smart enough to act as a physics professor, not just a student.

5. The Verdict

The AI took the test five times and got 30/30 every time.

Why is this important? It proves that AI is getting incredibly good at deep reasoning, not just memorizing facts.
The Caveat: Because the AI was released after the test, there is a possibility it saw the questions in its training data — though this is not certain. But even with that doubt, the fact that it solved these problems perfectly is a massive leap forward. Notably, the previous best score of 87.7% was also achieved using Gemini (via Gemini 3 Deep Think), meaning both systems share the same potential level of data contamination — yet this agent's perfect score represents a significant jump in capability.

In a nutshell: The researchers built a smart AI team that debates its own answers, uses digital tools to measure pictures perfectly, and even found errors in the test questions. The result? A robot that aced the world's hardest high school physics exam.

Perfect score on IPhO 2025 theory by Gemini agent

1. The Star Player: "Gemini 3.1 Pro"

2. The Strategy: "The Debate Club"

3. The Tool: "The Digital Ruler"

4. The "Clean-Up Crew" (Fixing the Test)

5. The Verdict

1. Problem Statement

2. Methodology

A. Dataset Curation and Pre-processing

B. Agent Architecture

C. Evaluation Protocol

3. Key Contributions

4. Results

5. Significance

Perfect score on IPhO 2025 theory by Gemini agent

1. The Star Player: "Gemini 3.1 Pro"

2. The Strategy: "The Debate Club"

3. The Tool: "The Digital Ruler"

4. The "Clean-Up Crew" (Fixing the Test)

5. The Verdict

1. Problem Statement

2. Methodology

A. Dataset Curation and Pre-processing

B. Agent Architecture

C. Evaluation Protocol

3. Key Contributions

4. Results

5. Significance

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks