Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

Imagine you are trying to teach a brilliant but inexperienced student (an AI) how to solve complex physics problems. The problem is, you don't have enough good practice tests. The ones you find online are either too easy, full of mistakes, or just "guess the answer" multiple-choice questions that don't actually teach the student how to think.

This paper introduces a new tool called the Infinite Problem Generator (IPG). Think of it as a super-intelligent, robotic physics tutor that can create an endless supply of custom-made, perfectly accurate practice problems.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Hallucination" Trap

Usually, when we ask an AI to make up a math or physics problem, it acts like a creative writer who doesn't know the rules. It might write a story about a car crashing into a wall, but the numbers it invents don't actually add up. It's like a chef who makes a beautiful-looking cake that tastes like soap. The AI "hallucinates" (makes things up) because it's just guessing the next word in a sentence, not actually doing the math.

2. The Solution: "Formulas as Code"

The authors realized that to fix this, they needed to stop treating physics equations like text and start treating them like computer code.

The Old Way: The AI writes, "Force equals mass times acceleration." It's just words.
The New Way (IPG): The AI treats "Force equals mass times acceleration" as a function in a computer program. It's a tool that must work.

Think of it like a Lego set. Instead of just drawing a picture of a castle, the IPG forces the AI to actually build the castle with real bricks. If the bricks don't fit, the castle falls, and the AI knows immediately, "Oops, that didn't work. Let me try again."

3. How the Robot Tutor Works (The Workflow)

The IPG uses a three-step assembly line to create these problems:

Step 1: The Architect (Analysis)
The system starts with a few high-quality, expert-written problems (like a seed). It analyzes them to understand the "blueprint": What laws of physics are used? What are the rules? (e.g., "Mass must be positive," "Time cannot be negative").
Step 2: The Storyteller (Generation)
Now, the AI gets creative. It takes the same physics blueprint but changes the story.
- Original: A block sliding down a ramp.
- New Version: A skateboarder going down a hill, or a roller coaster loop.
  The story changes, but the math underneath stays exactly the same. The AI is told to pick 3 to 5 specific "tools" (formulas) to solve the problem.
Step 3: The Inspector (Verification)
This is the magic part. Before the problem is ever shown to a student, the system runs the solution as a computer program.
- It tries to solve the problem using Python code.
- If the code crashes, or gives a result like "negative time" or "infinity," the system throws the problem in the trash.
- It only keeps the problem if the code runs perfectly and gives a real, sensible answer.

4. The "Complexity Blueprint" (The Secret Sauce)

The researchers discovered something fascinating. They found a direct link between how long the computer code is and how hard the problem is.

Simple Problem: Short code (like a + b = c).
Hard Problem: Long code (like a recipe with 10 steps).

This is like a fitness tracker for math. Instead of a human teacher guessing if a problem is "hard," the system can just count the lines of code. If the code is long, the problem is hard. This allows them to build a "curriculum" that starts easy and gets harder automatically, without needing a human to grade every single one.

5. The Result: A Massive Library of Perfect Problems

Using this method, they took 165 expert problems and turned them into 1,335 new, verified problems.

No Guessing: Every single problem is guaranteed to have a correct answer.
No Cheating: The problems require real reasoning, not just pattern matching.
Diverse: They cover everything from simple motion to complex spinning objects (Rigid Body Dynamics).

Why This Matters

Imagine you are training for a marathon.

Old Method: You run on a treadmill that sometimes stops, sometimes speeds up randomly, and the coach just yells "Go faster!" without checking your form.
IPG Method: You run on a track where every step is measured, the coach checks your form with a laser, and the route gets slightly harder every day based on your exact performance.

This paper gives us a way to generate infinite, high-quality training data for AI, ensuring that when the AI learns physics, it's learning the truth, not just making things up. It turns the AI from a "creative writer" into a "reliable engineer."

1. Problem Statement

The adaptation of Large Language Models (LLMs) to high-reasoning domains like physics is severely constrained by a scarcity of verifiable, high-quality training data.

Limitations of Current Methods: Standard text augmentation often introduces "hallucinations" (mathematically invalid statements), while static benchmarks (e.g., JEEBench, UGPhysics) are designed for testing rather than fine-tuning. They lack the dense, step-by-step reasoning traces required to train robust reasoners.
The Gap: Existing synthetic data generation (SDG) methods struggle to maintain logical coherence in physics because they rely on unstructured text or surface-level pattern matching, failing to enforce the rigorous multi-step deduction and implicit constraints inherent in physics problems.

2. Methodology: The Infinite Problem Generator (IPG)

The authors propose IPG, an agentic framework that synthesizes physics problems using a "Formula-as-Code" paradigm. Instead of treating equations as text tokens, IPG treats them as executable Python functions.

The workflow follows a Generate-then-Verify paradigm with three distinct phases:

Phase I: Problem Analysis & Context Expansion

Seed Input: The system starts with expert-written "Seed Tuples" (Question + Solution) from standard textbooks (e.g., Concepts of Physics by H.C. Verma).
Principle Extraction: The agent identifies core physical principles and maps them to real-world scenarios (e.g., mapping "angular acceleration" to "tire rotation" or "fishing reels").
Executable Axioms: Physics formulas are pre-compiled into a structured Python library. The agent queries a "Chapter Dictionary" to build an Available Formula Library relevant to the seed.
Constraint Extraction: A Variable Dictionary is constructed, defining valid physical ranges (e.g., mass $>0$ , friction coefficient $\in [0,1]$ ) to prevent physically impossible instantiations.

Phase II: Constrained Problem Generation

Narrative Variation: The agent cycles through different real-world scenarios while keeping the underlying physics invariant.
Formula Selection: The agent is explicitly instructed to select a specific number of formulas (e.g., 3–5) from the executable library to solve the problem.
Uniqueness Check: Each problem is assigned a Problem Signature (a hash of the formula set and the target variable). Collisions trigger regeneration to ensure diversity.
Difficulty Control: Complexity is controlled by limiting the size of the active formula subset.

Phase III: Code-Based Verification

Executable Solutions: For every generated problem, the agent must produce a Python script that solves the problem using only the selected formulas from the library.
Verification Criteria: The code is executed in a sandboxed environment. A problem is accepted only if:
1. Syntactic Validity: The code runs without errors.
2. Numerical Solvability: The output is finite (no NaN or $\infty$ ).
3. Physical Sanity: Results satisfy basic constraints (e.g., time $>0$ , mass $>0$ ).
Iterative Correction: Failed generations are re-prompted with structured error traces to correct specific logical or coding errors.

3. Key Contributions

Agentic Verification Framework (IPG): A novel pipeline that couples narrative variation with code-execution verification, significantly mitigating mathematical hallucinations in synthetic data.
ClassicalMechanicsV1 Dataset: A release of 1,335 high-fidelity classical mechanics problems (expanded from 165 expert seeds) with executable solution paths and verified numerical correctness.
The "Complexity Blueprint": A discovery of a strong linear correlation ( $R^2 \approx 0.95$ ) between the number of integrated physics formulas and the length of the verification code. This establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation without human annotation.

4. Results and Analysis

Dataset Statistics: The corpus spans 102 unique physical formulas with an average complexity of 3.05 formulas per problem. The distribution is Gaussian-like, centered on intermediate-depth reasoning (3 formulas), with a "complexity tail" of 260 problems requiring 4–6 formulas.
Domain Mixing: The agent successfully breaks chapter boundaries. For example, "Rigid Body Dynamics" problems utilized 53 unique formulas, drawing heavily from Kinematics and Energy chapters, rather than relying solely on the native chapter library.
Verification Success: The execution-based verification achieved a 99.85% success rate. Only 2 problems in the final set were numerically unstable.
Failure Mode Analysis:
- Low Complexity (0–1 formulas): Often trivial definitions rather than reasoning errors.
- Medium Complexity (2–3 formulas): High validity (>99%); primary issues were "unused distractor variables."
- High Complexity (4+ formulas): A "Fragility Shift" occurs where errors shift to Signature Mismatches (the agent derives intermediate values correctly but fails to chain them to the final target variable), highlighting current LLM limitations in long-horizon variable tracking.
Downstream Evaluation: When tested on a Qwen3-14B model, the dataset proved to be a rigorous benchmark. The model scored lower on IPG-generated problems (34.96%) compared to JEEBench (47.97%), suggesting the generated problems successfully capture and stress-test high-tier reasoning complexity without the "gameability" of multiple-choice formats.

5. Significance and Future Work

Bridging the Training-Testing Gap: IPG provides the first large-scale, "training-ready" corpus for physics that includes executable reasoning traces, addressing the lack of data for Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
Controllable Curriculum: The "Complexity Blueprint" allows researchers to generate datasets with specific difficulty levels programmatically, eliminating the need for expensive human labeling.
Future Directions:
- Expanded Domains: Extending the framework to Electromagnetism and Optics.
- Multimodal Integration: Generating visual diagrams (SVG/TikZ) alongside text to support geometry-intensive reasoning.
- Adaptive Assessment: Using the complexity signal to dynamically assemble problem sets for adaptive learning systems.

In summary, this work demonstrates that by shifting from probabilistic text generation to agentic, code-executed verification, it is possible to scale the production of high-quality, logically rigorous physics reasoning data, effectively solving the data scarcity bottleneck for training advanced reasoning models.