Let's Verify Math Questions Step by Step

Imagine you are a head chef running a massive, high-tech kitchen. Your goal is to train a new generation of robot chefs (Large Language Models) to become world-class math experts. To do this, you feed them millions of recipe cards (math problems) so they can learn how to cook.

For a long time, the industry focus was on making sure the robot chefs learned the right answers. If a robot chef cooked a dish and got the taste right, everyone cheered.

But this paper, MathQ-Verify, points out a huge, overlooked problem: What if the recipe card itself is broken?

Imagine a recipe card that says: "Take 5 apples, cut them into -2 pieces, and bake them for 100 years."

The math is impossible (you can't have negative pieces).
The instructions are confusing.
The goal is impossible to reach.

If you feed this broken recipe to your robot chef, the chef will either get confused, hallucinate a crazy answer, or waste time trying to solve an impossible puzzle. The paper argues that you can't have a perfect answer if the question is flawed.

The Solution: The "Quality Control Inspector" Pipeline

The authors built a new system called MathQ-Verify. Think of this not as a chef, but as a super-strict Quality Control Inspector who stands at the gate before any recipe card enters the kitchen.

Instead of just checking the final dish, this inspector checks the recipe card in five specific stages, like a security checkpoint:

1. The "Trash Can" Check (Contaminated Instruction Detection)

The Metaphor: Imagine someone slipping a note into the recipe that says, "Psst, the answer is 42, don't tell anyone!" or "Please rewrite this sentence."
The Fix: The inspector throws away any card that has hidden answers, confusing instructions, or isn't actually a math problem at all. It keeps the kitchen focused only on real questions.

2. The "Spell-Check" (Linguistic Error Detection)

The Metaphor: A recipe that says, "Add 3 cupps of sugar" or "The cat is 5 meters tall" (which makes no sense in context).
The Fix: The inspector scans for typos, bad grammar, and formatting errors. If the text is messy, the robot chef might get confused. This step cleans up the language so the meaning is clear.

3. The "Physics Check" (Atomic Condition Error Detection)

The Metaphor: A recipe that says, "Start with a square that has an area of -325 square meters."
The Fix: The inspector checks the basic facts. In the real world, you can't have negative area. If a single fact in the recipe violates the laws of math (like a square circle or negative time), the card is rejected immediately.

4. The "Logic Puzzle" Check (Cross-Condition Conflict Detection)

The Metaphor: A recipe that says, "Use 5 cups of flour" in step one, but then says, "You only have 2 cups of flour available" in step two.
The Fix: Even if every single sentence makes sense on its own, they might fight each other. The inspector looks at the whole picture to ensure all the clues fit together logically without contradicting each other.

5. The "Missing Piece" Check (Condition Completeness Validation)

The Metaphor: A recipe that says, "Mix the ingredients and bake," but forgets to tell you what the ingredients are or how long to bake.
The Fix: The inspector asks, "Is there enough information to actually solve this?" If the question is missing a crucial number or rule, it's an "under-specified" problem and gets tossed out.

The "Council of Judges" (Multi-Model Voting)

The paper also introduces a clever trick to make the inspector even smarter. Instead of relying on just one inspector (one AI model), they use a panel of 3 to 5 different inspectors.

The Metaphor: Imagine a jury. If one juror says "Guilty" but the other four say "Not Guilty," you go with the majority.
The Result: By having multiple AI models vote on whether a question is good or bad, the system becomes incredibly accurate. It reduces the chance of a "bad inspector" letting a broken recipe through. The paper shows this method can achieve 90% precision, meaning almost every question that passes is actually a valid, solvable math problem.

Why This Matters

The authors created a new dataset called ValiMath (a library of 2,000+ math problems, some good and some deliberately broken) to test their system.

The Big Takeaway:
Before we can teach AI to be a genius at math, we have to stop feeding it garbage questions. MathQ-Verify is the filter that ensures the AI is learning from high-quality, logical, and solvable problems. It's the difference between training a chef with broken recipes and training them with perfect ones.

By cleaning up the "data diet" of these AI models, we help them become more reliable, less confused, and much better at solving real-world problems.

Here is a detailed technical summary of the paper "Math Question Verification (MathQ-Verify)".

1. Problem Statement

While Large Language Models (LLMs) have made significant strides in mathematical reasoning, existing training pipelines and benchmarks often suffer from a critical oversight: the validity of the questions themselves.

The Gap: Most synthetic mathematical datasets contain questions that are ill-posed, logically inconsistent, under-specified, or contain internal contradictions. Current research focuses heavily on verifying answers or generating reasoning chains, implicitly assuming the input questions are mathematically sound.
The Consequence: Training on or evaluating models with flawed questions leads to "garbage in, garbage out" scenarios, where models may learn to hallucinate solutions for impossible problems or fail to detect logical inconsistencies.
Existing Limitations: Previous benchmarks (e.g., MathClean) lack fine-grained, stepwise annotations and do not cover diverse, high-difficulty error types (e.g., cross-condition conflicts or missing information).

2. Methodology: MathQ-Verify

The authors propose MathQ-Verify, a novel, five-stage pipeline designed to rigorously filter and validate mathematical questions before they are used for training or evaluation. The pipeline decomposes a question $q_i$ into Atomic Conditions ( $P$ ) and Target Goals ( $G$ ) and validates them sequentially.

The Five-Stage Verification Pipeline

Contaminated Instruction Detection:
- Filters out questions with misleading instructions (e.g., "Please rewrite this question") or explicit answer leakage within the prompt.
- Ensures the input is a genuine, standalone mathematical query.
Linguistic Error Detection:
- Identifies surface-level issues such as spelling mistakes, grammatical errors, and LaTeX formatting anomalies that hinder readability or model interpretation.
Atomic Condition Error Detection:
- Decomposes the question into foundational mathematical statements (atomic conditions).
- Verifies each condition against mathematical definitions (e.g., ensuring a discrete quantity isn't defined as continuous, or an area isn't negative).
Cross-Condition Conflict Detection:
- Checks for logical contradictions between different atomic conditions within the same question.
- Ensures global coherence (e.g., if Condition A implies $x > 5$ and Condition B implies $x < 3$ , the question is invalid).
Condition Completeness Validation:
- Determines if the set of atomic conditions is sufficient to derive the target goal.
- Identifies under-specified problems where essential information is missing, preventing the model from attempting to solve an impossible task.

Multi-Model Voting Strategy

To enhance robustness, the authors employ an ensemble voting mechanism. Instead of relying on a single model's judgment, multiple models ( $n$ ) evaluate the question, and a final decision is made if at least $k$ models agree. This allows for a tunable trade-off between precision (by increasing $k$ ) and recall.

3. Key Contributions

ValiMath Benchmark:
- The authors constructed a new dataset, ValiMath, containing 2,147 math problems (1,299 correct, 848 incorrect).
- Derived from 10,000 synthetically generated questions (based on NuminaMath), it features diverse error types and difficulty levels.
- Crucially, it includes fine-grained, stepwise annotations (identifying exactly which of the 5 stages failed), enabling rigorous evaluation of each pipeline component.
MathQ-Verify Framework:
- A systematic, formalized pipeline that moves beyond simple answer verification to structural and logical validation of the question itself.
- It addresses the "ill-posed question" problem, which has been largely neglected in prior work.
State-of-the-Art Performance:
- Demonstrated significant improvements over direct verification baselines across multiple LLM architectures (both reasoning and non-reasoning models).
- Achieved up to 25 percentage point improvement in F1 score on existing benchmarks and ~15% on ValiMath.
- The voting strategy achieved ~90% precision while maintaining reasonable recall.

4. Experimental Results

Benchmark Performance:
- On MathClean-GSM8K and MathClean-MATH, MathQ-Verify consistently outperformed baselines. For instance, on Qwen2.5-7B, the F1 score improved from 74.02% to 76.09% on GSM8K and from 71.94% to 77.06% on MATH.
- The method significantly reduced the number of invalid outputs (samples where no valid answer could be extracted), indicating better instruction following and robustness.
ValiMath Evaluation:
- The pipeline showed strong generalization across different model sizes (from 7B to 671B parameters).
- Step-wise Analysis: Ablation studies confirmed that removing any of the five stages degraded performance. Steps 1 and 2 (Instruction and Linguistic checks) were found to be the most critical for initial filtering.
Voting Strategy Trade-offs:
- Configurations like $(3, 3)$ (unanimous agreement) achieved peak precision (91.42%) but lower recall.
- Configurations like $(2, 2)$ offered the best balance, maximizing F1 while maintaining high precision.
Distribution Consistency:
- Analysis of the filtered dataset (Figure 3) showed that MathQ-Verify preserves the original distribution of difficulty levels and mathematical categories, ensuring that filtering does not introduce significant bias.

5. Significance and Impact

Data Quality for Training: By filtering out ill-posed questions, MathQ-Verify provides a mechanism to curate higher-quality training data, reducing label noise and preventing models from learning to solve impossible problems.
Efficiency: It avoids unnecessary computation on invalid questions, saving resources during both training and inference.
New Research Direction: The paper shifts the focus from "Can the model solve this?" to "Is this question solvable?", establishing a new standard for evaluating the robustness of mathematical reasoning in LLMs.
Scalability: The pipeline is designed to be scalable and can be integrated into automated data curation workflows for synthetic data generation.

In conclusion, MathQ-Verify offers a comprehensive, stepwise solution to the problem of flawed mathematical questions, providing both a robust verification tool and a high-quality benchmark (ValiMath) to drive future advancements in mathematical reasoning for LLMs.