MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Imagine you are trying to teach a brilliant student (a Large Language Model, or LLM) how to become a grandmaster at math.

Currently, the student is good at solving textbook problems, but they hit a wall when faced with truly difficult, Olympiad-level challenges. Why? Because the "textbooks" we have are full of easy and medium problems. There just aren't enough super-hard practice questions available to train them.

Existing methods try to fix this by taking an easy problem and "remixing" it—changing the numbers or the wording. But this is like taking a simple recipe for toast, adding a little extra butter, and calling it a gourmet meal. It's not actually new, and the student eventually memorizes the pattern rather than learning to think deeply.

Enter MathSmith. Think of MathSmith not as a remix artist, but as a Master Blacksmith.

The Blacksmith's Forge: How MathSmith Works

Instead of recycling old problems, MathSmith builds new, incredibly tough problems from scratch using a three-step process:

1. Gathering Raw Materials (The Concept Mine)

Most methods start with a finished problem. MathSmith starts with raw materials. It digs into a massive encyclopedia of advanced math (PlanetMath) and pulls out pure, abstract concepts like "Hermitian inner products" or "Lattices with operators."

Analogy: Imagine a chef who doesn't buy pre-made lasagna. Instead, they go to a farm, pick fresh, rare vegetables, and grind their own spices. MathSmith gathers these "concept nuggets" randomly, ensuring it never accidentally copies a problem the student has already seen (avoiding "cheating" or data contamination).

2. The Blueprint (The 9 Difficulty Strategies)

To turn these raw concepts into a "hard" problem, MathSmith uses a special blueprint with 9 rules for difficulty. These are like the blacksmith's tools to make the metal tougher.

The Tools include:
- Multi-step Reasoning: The problem can't be solved in one jump; it needs a long chain of logic.
- Cross-topic Integration: It forces the student to mix algebra with geometry, or number theory with calculus.
- Hidden Traps: It includes "distractors" (red herrings) to trick the student.
- Extreme Conditions: It pushes the math to its absolute limits.
The Process: The AI acts like a master architect, randomly picking two or three concepts and forcing them together using these rules to build a brand new, complex structure.

3. The Quality Control (Reinforcement Learning)

This is the most magical part. Once MathSmith builds a problem, it doesn't just guess if it's good. It puts the problem through a stress test.

The "Thinking" Test: It asks a super-smart AI teacher to solve the problem.
The "Length" Metric: The researchers noticed something interesting: Harder problems make the AI think longer. If the teacher AI writes a very long, detailed chain of thought to solve it, that's a sign the problem is truly difficult.
The Reward: MathSmith gets a "gold star" (reward) if the problem is:
1. Valid: It actually makes sense mathematically.
2. Complex: It forces the teacher to write a long, deep solution.
3. Consistent: Everyone who solves it gets the same answer (no ambiguity).

If the problem is too easy, the AI gets no stars and tries again. If it's a masterpiece, it gets rewarded. Over time, the AI learns to forge only the hardest, most interesting problems.

The "Weakness-Focused" Repair Shop

One of the coolest features is the Weakness-Focused Pipeline.

Analogy: Imagine a coach watching a soccer player miss every penalty kick. Instead of making them run laps, the coach creates specific drills just for penalty kicks.
MathSmith does this for math. If a student model keeps failing at a specific concept (like "GCD conditions"), MathSmith generates a batch of new problems specifically targeting that weakness to help the student improve exactly where they are struggling.

The Results: Why It Matters

The researchers tested this on some of the hardest math competitions in the world (like AIME and Olympiads).

The Outcome: Models trained on MathSmith's synthetic problems got significantly better at solving these hard challenges than models trained on traditional methods.
The Takeaway: By forcing the AI to generate its own "Olympiad-level" training data, we are unlocking a new level of reasoning. It's not just about memorizing more facts; it's about learning how to think through complex, multi-layered puzzles.

In short: MathSmith is an AI that acts as a tireless, creative math teacher. It doesn't just give you more homework; it invents new kinds of homework that are perfectly designed to stretch your brain, ensuring you become a true problem-solving master.

Here is a detailed technical summary of the paper "MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy."

1. Problem Statement

Large Language Models (LLMs) have made significant strides in mathematical reasoning, but their progress is currently bottlenecked by the scarcity of high-quality, high-difficulty training data. Existing data synthesis methods largely rely on transforming human-written problems (e.g., rewriting, back-translation, or augmentation). These approaches suffer from:

Limited Diversity: They are constrained by the distribution and structure of existing human-authored problems.
Lack of Autonomy: They cannot generate problems from scratch, limiting the exploration of novel reasoning paths.
Data Contamination: Relying on existing datasets risks the model memorizing patterns rather than learning to reason.
Difficulty Control: Current methods often lack precise mechanisms to control or escalate problem difficulty beyond simple prompt-level labels.

The authors argue that to advance AI reasoning, models must be capable of autonomously generating intellectually challenging problems that force deep, long-chain reasoning.

2. Methodology: The MathSmith Framework

MathSmith is a novel framework designed to synthesize challenging mathematical problems from scratch. It operates through a three-stage pipeline, moving from concept collection to reinforcement learning optimization.

A. Concept and Explanation Collection (Data Independence)

To ensure data independence and avoid contamination, MathSmith does not use existing math problems as seeds. Instead:

Source: It scrapes PlanetMath, a repository of advanced mathematical concepts.
Process: It extracts approximately 11,000 "Concept + Explanation" pairs.
Mechanism: GPT-4o is used to summarize core concepts from raw text, creating a clean dataset of abstract mathematical building blocks.

B. Supervised Fine-Tuning (SFT) Stage

Base Model: Qwen3-8B.
Cold Start: GPT-4o generates ~8,000 initial training samples by sampling 5 concepts and forcing the model to construct a problem.
Structure: Each sample includes a Rationale (5-step reasoning process: Analyze, Select, Explain, Incorporate Difficulty, Formulate) and the Problem.
Difficulty Strategies: To ensure complexity, the model is instructed to incorporate at least two of nine predefined difficulty strategies:
1. Multi-step Reasoning
2. Cross-topic Integration
3. Implicit or Reverse Logic
4. Distractor Construction
5. Abstract Modeling
6. Multiple Solution Paths
7. Advanced Manipulation
8. Extreme Conditions
9. Non-standard Representation

C. Reinforcement Learning (RL) Stage

The SFT model is further optimized using Group Relative Policy Optimization (GRPO) to refine problem quality. The reward function is a composite of three components:

Structural Reward ( $r_{structure}$ ): Ensures the output contains valid "rationale" and "problem" segments and adheres to the 5-step rationale format.
Reasoning Complexity Reward ( $r_{complexity}$ ):
- Heuristic: Uses the token length of the reasoning trace generated by a powerful teacher model (Qwen3-30B-A3B) as a proxy for problem difficulty.
- Logic: Longer CoT traces imply deeper, more structured reasoning. The reward is proportional to the normalized token length of the solution.
Answer Consistency Reward ( $r_{consistency}$ ):
- Samples $K$ solutions from the teacher model.
- Assigns a reward of 1 if a majority answer exists (indicating the problem is well-posed and unambiguous), and 0 otherwise.

Variants:

MathSmith-HC: Optimized with both Complexity and Consistency rewards.
MathSmith-Hard: Optimized with only the Complexity reward (prioritizing difficulty over strict consistency).

D. Weakness-Focused Improvement Pipeline

A unique feature of MathSmith is its ability to target specific model weaknesses. Since every generated problem is traceable to its source concepts, the framework can:

Identify concepts where a target model performs poorly.
Generate variant problems specifically conditioned on those weak concepts.
Fine-tune the model on these targeted variants to improve performance on specific underperforming areas.

3. Key Contributions

Autonomous Problem Synthesis: A framework that constructs problems from scratch using randomly sampled concept-explanation pairs, eliminating reliance on human-written templates and minimizing data contamination.
Multi-Objective RL Optimization: Introduction of a reward system that jointly optimizes structural validity, reasoning depth (via CoT length), and answer consistency.
Difficulty Strategies: Definition and integration of nine specific strategies to systematically increase problem complexity.
Weakness-Focused Generation: A mechanism to generate targeted variants to address specific conceptual gaps in model performance.
State-of-the-Art Performance: Demonstrated superior generalization on hard benchmarks compared to existing synthesis methods.

4. Experimental Results

The authors evaluated MathSmith on five benchmarks: GSM8K, MATH-500 (Easy/Medium) and AIME2024, AIME2025, OlympiadBench (Hard).

Performance on Hard Benchmarks:
- MathSmith consistently outperformed baselines (MetaMath, NuminaMath, OpenMathInstruct, PromptCOT).
- On Hard benchmarks, MathSmith-HC achieved relative improvements of 9.8% to 18.1% over the best baselines.
- Under Long-CoT settings, the performance gap widened significantly, suggesting MathSmith problems better elicit complex reasoning.
Scalability:
- Data Scaling: Performance on the Olympiad benchmark improved consistently as training data increased from 50K to 200K, outperforming baselines with wider margins at larger scales.
- Model Scaling: Larger models (e.g., Qwen3-32B) benefited more from MathSmith data than smaller models, indicating the data's high value for deep reasoning capabilities.
Reasoning Depth: Problems synthesized by MathSmith variants elicited significantly longer reasoning traces (up to ~29k tokens) compared to other datasets, validating the link between the reward design and problem complexity.
Weakness Improvement: Targeted generation on weak concepts led to consistent accuracy gains (e.g., +10% on practice sets) and improved generalization to other benchmarks.

5. Significance and Conclusion

MathSmith represents a paradigm shift in mathematical data synthesis. By moving away from template-based rewriting to concept-based construction guided by reinforcement learning, it addresses the critical bottleneck of high-quality, high-difficulty data scarcity.

Implication for AI: It validates the hypothesis that synthetic data, when generated with rigorous difficulty controls and structural constraints, can effectively push LLMs toward Olympiad-level reasoning.
Future Direction: The work suggests that "The Bitter Lesson" (scaling computation over handcrafted rules) applies to data generation as well. Future work aims to refine difficulty estimation and expand domain coverage, potentially enabling fully synthetic curricula for advanced reasoning tasks.

The code and data are open-sourced, facilitating further research into scalable synthetic data generation for reasoning tasks.