A method for the automated generation of proof exercises with comparable levels of proving complexity

Imagine you are a math teacher. Your job isn't just to stand in front of a class and lecture; it's also to create hundreds of practice problems for your students. You need to make sure that if you give a problem to a beginner, it's not too hard, and if you give one to an advanced student, it's not too easy.

The problem is: Making these problems by hand is exhausting. It takes hours to design a proof question that is "just right."

This paper introduces a robot assistant that can automatically generate these proof problems. But here's the catch: most robots are bad at judging difficulty. They might give a student a problem that looks simple but is actually a nightmare to solve, or vice versa.

This paper presents a new method to teach the robot how to judge difficulty accurately, specifically for mathematical proofs (like those in Set Theory or Number Theory).

Here is how it works, broken down with some creative analogies:

1. The Problem: The "Look-Alike" Trap

Imagine you have two puzzles.

Puzzle A looks like a simple 3-piece jigsaw.
Puzzle B looks like a simple 3-piece jigsaw.

To the naked eye, they look identical. But if you try to solve them, Puzzle A fits together in 5 seconds, while Puzzle B requires you to force the pieces, break a few, and take 20 minutes.

Current computer systems often judge difficulty by how the puzzle looks (how many pieces, how complex the picture is). This paper argues that's wrong. We need to judge difficulty by how the puzzle is solved.

2. The Solution: The "Blueprint" Approach

The authors created a system that doesn't just look at the question; it builds a blueprint of the solution before it even decides if the question is good.

They use a special kind of logic tool called "Theory-Specific Tableaux."

The Analogy: Think of a proof as a tree growing in a garden.
The Rules: Instead of letting the tree grow wild, they give it a strict set of gardening rules (called definitional axioms). These rules say, "You can only grow a branch if you use these specific tools."
The Result: Because the rules are so strict, every proof becomes a clean, structured tree with no messy "logical symbols" (like confusing math jargon). It's like translating a messy, handwritten recipe into a standardized, step-by-step cooking instruction card.

3. Measuring Difficulty: Counting the Steps

How does the robot know if two problems are equally hard?
It looks at the structure of the solution tree.

The Analogy: Imagine two hikers trying to reach the top of a mountain.
- Hiker A takes a path with 10 small steps.
- Hiker B takes a path with 10 small steps.
- Even if the mountains look different, the effort is the same because the number and type of steps are identical.

The robot calculates the "size" of the solution tree (how many branches, how many steps). If two different math problems require solution trees that are structurally identical (like two trees that are mirror images of each other), the robot declares them to have the same difficulty level.

4. The Magic Trick: The "Cut"

To make this work, the robot uses a technique called a "Cut."

The Analogy: Imagine you are solving a mystery. You have a clue that says "The butler did it." You also have a clue that says "The butler didn't do it."
In normal logic, you might get stuck arguing back and forth.
The "Cut" method is like a referee stepping in and saying, "Okay, let's assume the butler did it. Does that lead to a contradiction? Yes? Okay, assume he didn't. Does that lead to a contradiction? Yes? Great, we found our answer."

This method allows the robot to cut through the noise and see the core structure of the proof, ignoring the fluff. This ensures that the difficulty measurement is based on the logic, not the wording.

5. Generating New Problems

Once the robot has a "Gold Standard" problem (one with a known, perfect solution tree), it goes into a substitution machine.

The Analogy: Imagine you have a perfect cake recipe. You know exactly how hard it is to bake (mixing 3 bowls, baking for 20 mins).
The robot takes that recipe and swaps the ingredients: "Okay, instead of flour, let's use almond meal. Instead of eggs, let's use applesauce."
It checks: "Does this new recipe still require mixing 3 bowls and baking for 20 minutes?"
If yes, it's a new problem with the exact same difficulty as the original.

The robot does this with math symbols. It swaps "Union" ( $\cup$ ) for "Intersection" ( $\cap$ ) or "Difference" ( $\setminus$ ), but it checks to make sure the structure of the solution remains the same.

Why Does This Matter?

For Teachers: You can generate infinite practice problems. If you need 50 problems that are "Medium Difficulty," the robot can spit them out instantly, knowing they are all truly the same level of challenge.
For Students: No more unfair surprises. Everyone gets a problem that matches their current skill level.
For Personalized Learning: Imagine a video game that adjusts the level of the boss fight based on how well you are playing. This system could do that for math homework, giving harder proofs only when a student is ready for them.

In a Nutshell

This paper teaches a computer to stop judging math problems by their cover. Instead, it teaches the computer to solve the problem first, measure the effort required to solve it, and then generate new problems that require the exact same amount of mental effort to solve. It's like a master chef who can create a thousand new recipes that all take exactly the same amount of time and skill to cook.

Here is a detailed technical summary of the paper "A method for the automated generation of proof exercises with comparable levels of proving complexity" by Mendes, Marcos, and Terrematte.

1. Problem Statement

The paper addresses a critical gap in Automatic Question Generation (AQG) for formal disciplines, specifically Discrete Mathematics and Logic. While AQG tools exist, they lack mechanisms for fine-grained control over the difficulty of generated exercises.

Current Limitations: Existing methods often rely on:
- Syntactic metrics: Counting atoms, connectives, or formula depth. These fail because two formulas with identical syntax can require vastly different proof strategies and effort.
- Machine Learning (QDET): These models predict difficulty based on human labeling, which is inconsistent, lacks explainability, and does not guarantee pedagogical alignment.
The Core Challenge: How to automatically generate proof exercises that are guaranteed to have comparable levels of proving complexity (i.e., similar cognitive effort to solve) without relying on subjective human judgment.

2. Methodology

The authors propose a formal, logic-based approach that shifts the focus from the syntax of the problem statement to the structure of the solution (the proof). The methodology consists of four main pillars:

A. Theory-Specific Proofs (Cut-Based Tableaux)

Instead of using standard Smullyan-style tableaux, the method employs Theory-Specific Proofs based on the KE methodology (cut-based tableaux).

Logical Symbol-Free: The proofs are constructed using theory-specific signed formulas (atomic formulas with signs $+$ or $-$ ) that contain no logical connectives ( $\neg, \land, \lor, \to$ ) or quantifiers in the proof steps.
Rule Extraction: Rules for these proofs are not hand-coded but mechanically extracted from definitional axioms of a specific mathematical theory (e.g., Set Theory).
- Process: Definitional axioms $\to$ Prenex Normal Form (PNF) $\to$ Skolemization $\to$ Conjunctive Normal Form (CNF) $\to$ Rule Implicational Normal Form (RINF).
- Result: A set of linear expansion rules where premises and conclusions are purely atomic (theory-specific).
Analytic Restrictions: To ensure termination and control complexity, rules must satisfy three restrictions:
1. Variables in the conclusion must exist in the premises.
2. The predicate symbol in the conclusion must be "smaller" (according to a well-founded relation) than in at least one premise.
3. If predicates are identical, the syntactic depth of the conclusion must be smaller than the premises.

B. Defining Proving Complexity

The paper defines "complexity" not by the length of the text, but by the deductive size of the minimal proof.

Justification Trees: A proof is represented as a tree where nodes are justified by rule applications.
Deductive Isomorphism: Two proofs are considered to have comparable complexity if their justification trees are isomorphic (same structure) and the formulas within them are syntactically isomorphic (same abstract structure and variable sets).
Minimal Proofs: The complexity of an exercise is determined by its minimal ER-proof (the proof with the fewest nodes among all valid proofs).

C. The Generation Procedure

The automated generation process takes an input exercise (defined as a set of signed formulas) and performs two steps:

Search for Minimal Proofs: The system constructs tableaux incrementally ( $T_1, T_2, \dots$ ) until it finds all minimal proofs for the input set. This establishes the "complexity baseline."
Search for Proof-Isomorphic Sets: The system generates new exercises by replacing symbols in the input exercise with deductive matching symbols.
- Deductive Matching: A symbol $s_2$ is a match for $s_1$ if they appear in structurally identical positions within the premises of isomorphic rules.
- Constraint: The system only generates candidates that, when proven, result in a minimal proof deductively isomorphic to the original minimal proof.

3. Key Contributions

Formal Definition of Proving Complexity: The paper introduces a rigorous, non-heuristic definition of difficulty based on the isomorphism of minimal proof structures rather than surface-level syntax.
Theory-Specific Tableaux: It adapts the KE cut-based methodology to create a deductive system free of logical symbols, allowing for the extraction of rules directly from mathematical definitions (e.g., Set Theory axioms).
Mechanizable Rule Extraction: A novel procedure is presented to convert definitional axioms into a set of linear, analytic, theory-specific rules suitable for automated proof search.
Proof-Isomorphic Generation Algorithm: A computational procedure that guarantees generated exercises have the same "effort" to solve as the input by enforcing structural isomorphism of their minimal proofs.
Explainability: Unlike ML models, the difficulty of a generated exercise is explicitly explained by its minimal proof structure.

4. Results and Evaluation

Case Study: The method was implemented and tested using a fragment of Set Theory (involving operations like $\cup, \cap, \setminus, \times, \triangle, \subseteq, \in$ ).
Prototype: A working prototype is available (GitHub: joaomendesln/GePECC).
Demonstration: The authors demonstrated that exercises like "Prove $x \in y \cap (w \cup z) \implies x \in (y \cap w) \cup z$ $x \in y \cap (w \cup z) ⟹ x \in (y \cap w) \cup z$ " and "Prove $x \in y \setminus (w \triangle z) \implies x \in (y \setminus w) \cup z$ $x \in y ∖ (w △ z) ⟹ x \in (y ∖ w) \cup z$ " are generated as having comparable complexity.
- Both exercises require minimal proofs with identical tree structures (deductive isomorphism), despite using different set operators.
Efficiency: The use of "deductive matching symbols" significantly reduced the search space for candidate exercises (e.g., reducing test cases by ~40x in the example provided).

5. Significance and Future Work

Pedagogical Impact: This approach enables the creation of personalized adaptive learning systems where tutors can generate infinite variations of a problem with guaranteed, consistent difficulty levels. This addresses the "difficulty control" bottleneck in AQG.
Theoretical Advancement: It bridges the gap between informal mathematical reasoning (which often hides logical steps) and formal proof systems by using theory-specific rules that mimic the "element argument" and other standard proof strategies.
Limitations & Future Directions:
- Scope: Currently limited to exercises describable in Signed Theory-Specific Normal Form (STSNF). Exercises requiring complex propositional structures not reducible to this form are out of scope.
- Axiom Restrictions: Some definitional axioms cannot yield theory-specific rules due to analytic restrictions; future work aims to relax these.
- Pedagogical Validation: The authors plan to conduct empirical studies to verify if students perceive exercises with isomorphic proofs as having equal difficulty, particularly regarding the cognitive load of multi-premise rules vs. single-premise rules.

In summary, this paper provides a robust, logic-driven framework for automating the generation of mathematically rigorous proof exercises, ensuring that difficulty is controlled by the intrinsic structural complexity of the solution rather than superficial syntactic features.