Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Imagine you are trying to solve a incredibly difficult puzzle, like the ones found in the world's hardest math competitions (the International Mathematical Olympiad, or IMO). For years, computers have been terrible at the specific type of puzzle called Geometry.

Why? Because geometry isn't just about crunching numbers. It's about creativity. To solve a hard geometry problem, you often have to draw an invisible line, add a hidden point, or create a new shape that isn't in the original picture. This is called an "auxiliary construction."

Old computer programs were like rigid robots: they could follow rules perfectly, but they couldn't "guess" what to draw next. They needed millions of examples to learn, and even then, they often got stuck.

Enter InternGeometry, a new AI agent that acts more like a brilliant human student than a robot. Here is how it works, explained through simple analogies:

1. The "Thinker" vs. The "Calculator"

Most old AI systems were like calculators: they tried to brute-force every possible answer until they got lucky.

InternGeometry is like a detective.

The Detective's Process: Instead of just guessing, the detective (the AI) looks at the crime scene (the geometry problem) and says, "Hmm, if I draw a line here, maybe I can prove this angle is 90 degrees."
The Lab: It then goes to a "lab" (a symbolic engine) to test that idea.
The Feedback: If the lab says, "No, that line doesn't work," the detective doesn't give up. They say, "Okay, that failed. Let me try a different angle."
The Memory: The detective keeps a notebook (Dynamic Memory) so they don't forget which ideas failed and which ones showed promise, even after 200 tries.

2. The "Video Game Level" Training (CBRL)

How do you train a detective to solve the hardest cases? You don't start them on a murder mystery if they can't even solve a missing sock.

The researchers used a method called Complexity-Boosting Reinforcement Learning (CBRL). Think of this as a video game with a smart difficulty slider:

Level 1: The AI is given easy puzzles. It solves them and gets a "thumbs up."
Level 2: Because it's getting good, the game automatically makes the next puzzle slightly harder.
The Sweet Spot: The system constantly adjusts the difficulty to be "just right"—not so easy that it's boring, and not so hard that the AI gives up.
The Result: By the time the AI reaches the "Boss Level" (IMO problems), it has been trained on a perfect curriculum that slowly built its skills, rather than throwing it into the deep end immediately.

3. The "Magic Trick" of Efficiency

Here is the most surprising part:

The Old Way (AlphaGeometry 2): To learn, this system needed 300 million examples. It was like trying to learn to cook by reading every recipe book in the world.
The New Way (InternGeometry): This system learned with only 13,000 examples. That is 0.004% of the data the old system used.

It's the difference between a student who memorizes a dictionary to learn a language versus a student who actually talks to people, makes mistakes, learns from them, and improves rapidly.

4. The Result: Beating the Gold Medalists

When they tested InternGeometry on the last 25 years of the world's hardest geometry problems:

It solved 44 out of 50 problems.
The average score of a human Gold Medalist (the top 0.1% of math students) is about 40.9.
InternGeometry beat the average Gold Medalist.

Even cooler? In some cases, the AI came up with a solution that no human had ever thought of. It invented a new way to draw the lines that was more elegant than the human solution.

Summary

InternGeometry is a new AI that doesn't just "calculate" geometry; it reasons like a human.

It thinks out loud (proposing ideas).
It tests them (using a math engine).
It remembers its mistakes (dynamic memory).
It learns by playing a game that gets harder as it gets smarter (Complexity-Boosting RL).

It proves that with the right training method, AI doesn't need massive amounts of data to become a genius; it just needs the right way to practice.

1. Problem Statement

While Large Language Model (LLM) agents have shown strong performance in general mathematical reasoning and programming, solving International Mathematical Olympiad (IMO) level geometry problems remains a significant challenge.

The Bottleneck: Geometry proofs often require "weak heuristics" for auxiliary constructions (e.g., adding specific points, lines, or circles that are not explicitly given). These constructions are creative and require multiple trials, which current LLMs struggle to generate without massive search or expert models.
Current State-of-the-Art (SOTA): Systems like AlphaGeometry 2 and SeedGeometry achieve medalist-level performance but rely heavily on:
- Massive scale data synthesis (hundreds of millions of examples).
- Expert models trained specifically for geometry.
- Extensive search trees (beam search) during inference.
The Goal: Can an LLM agent achieve expert-level geometry performance with high data efficiency, better generalization, and creative reasoning without relying on massive expert-model search?

2. Methodology: InternGeometry

The authors propose InternGeometry, an LLM agent built on the InternThinker-32B backbone, designed to solve geometry problems through long-horizon interaction with a symbolic engine.

A. Core Architecture: InternGeometry-DDAR

The agent interacts with a custom symbolic engine, InternGeometry-DDAR (Deductive Database Arithmetic Reasoning), which is an enhanced version of the open-source Newclid system.

Capabilities: It supports complex geometric structures, global point optimization (adjusting points to satisfy multiple constraints simultaneously), and handling "double points" (distinctly named points with identical coordinates).
Interaction Loop:
1. Think: The LLM performs natural language reasoning to plan a strategy.
2. Act: The LLM outputs a structured action in a Domain Specific Language (DSL), such as add (construct an auxiliary point) or propose (state a sub-goal/proposition).
3. Feedback: The symbolic engine executes the action and returns a result (success/failure, new proven facts).
4. Reflect: The agent analyzes the feedback to guide the next step.

B. Dynamic Memory Mechanism

To handle the "weak heuristic" problem, the agent must explore many possibilities over a long horizon (up to 200+ interaction steps per problem).

Challenge: Standard context windows cannot hold hundreds of turns of detailed history without losing focus or exceeding limits.
Solution: A Dynamic Memory Manager ( $W$ ) compresses the interaction history. It retains:
- Core action outputs (what was built).
- Key environment feedback (what was proven or failed).
- It discards verbose reasoning details while preserving the "state" of the proof.
Rejection Sampling: To prevent "action collapse" (repeating the same failed patterns), the agent uses a rule-based check to reject outputs that repeat previous actions or lack valid progress.

C. Complexity-Boosting Reinforcement Learning (CBRL)

To train the agent efficiently without massive datasets, the authors introduce CBRL, a curriculum learning framework.

Cold Start: The model is first fine-tuned on 7K formalized geometry problems (SFT).
Iterative Curriculum:
1. Data Synthesis: A pipeline generates synthetic geometry problems with a specific complexity level ( $\kappa$ ), defined by the number of proof steps required by the DDAR engine.
2. RL Training: The agent is trained on problems at the current complexity level using GRPO (Group Relative Policy Optimization).
3. Complexity Adjustment: The system calculates the average absolute advantage of the agent's performance.
  - If the agent solves too many problems (reward > 0.5), the complexity $\kappa$ is increased.
  - If the agent fails too often (reward < 0.5), $\kappa$ is decreased.
- Goal: Keep the agent in a "moderate difficulty" zone where learning signals are strongest, gradually scaling up to expert-level problems.

3. Key Contributions

First Medalist-Level LLM Agent for Geometry: InternGeometry is the first LLM-based agent to reach gold-medalist performance on IMO geometry problems without relying on the massive search trees of expert models.
Extreme Data Efficiency: The model achieves its performance using only 13,000 training examples. This is 0.004% of the data used by AlphaGeometry 2 (300M examples) and 0.006% of SeedGeometry.
Long-Horizon Reasoning: Demonstrates that allowing agents to interact with tools for >200 steps (with memory compression) is critical for overcoming weak heuristics in geometry.
Creativity in Auxiliary Constructions: The agent can discover novel auxiliary constructions that differ from human solutions (e.g., using isogonal conjugates in quadrilaterals in a way not seen in standard human proofs).
CBRL Framework: A novel reinforcement learning strategy that automatically adjusts task difficulty to maximize learning efficiency, proving that "curriculum learning" is essential for scaling RL in complex domains.

4. Experimental Results

Performance on IMO 50 (2000–2024)

InternGeometry: Solved 44 out of 50 problems.
Comparison:
- Surpassed the average score of IMO Gold Medalists (40.9 points).
- Outperformed AlphaGeometry 2 (42/50) and SeedGeometry (43/50).
- Solved IMO 2025 geometry problems (not included in the training set).
Efficiency: Achieved this with a test-time budget of Pass@256 (256 sampling attempts), whereas AlphaGeometry 2 uses complex ensembles of search trees with much higher computational overhead.

Ablation Studies

Long-Horizon Interaction: Removing the ability to perform long chains of reasoning (limiting steps) drastically reduced performance (e.g., dropping from 44/50 to 23/50 when removing "Slow Thinking" and "Context Compression").
CBRL Effectiveness:
- Training only on easy data: 29/50 (poor generalization).
- Training only on hard data: 24/50 (convergence failure).
- CBRL (Dynamic Difficulty): 44/50. This confirms that gradually increasing complexity is vital for learning.

5. Significance and Conclusion

This work represents a paradigm shift in automated theorem proving for geometry:

From Search to Reasoning: It moves away from the "brute force search" paradigm of expert models (AlphaGeometry) toward LLM-driven reasoning where the model learns to explore and reflect.
Scalability: It proves that LLM agents can master highly complex, creative tasks (like IMO geometry) with orders of magnitude less data than previously thought possible, provided the training involves dynamic curriculum learning and long-horizon tool interaction.
Generalization: The ability to solve problems with novel constructions suggests that LLM agents can develop genuine geometric intuition rather than just pattern matching on known theorems.

In summary, InternGeometry demonstrates that with the right combination of dynamic memory, symbolic tool integration, and complexity-boosted reinforcement learning, LLMs can achieve and even exceed human expert performance in the most challenging mathematical domains.