Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

This paper systematically reviews recent advancements in Multimodal Mathematical Reasoning by proposing a unified Perception-Alignment-Reasoning paradigm, categorizing existing approaches around four fundamental questions regarding information extraction, representation, reasoning, and evaluation, while outlining future research challenges.

Tianyu Yang, Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao, Minhao Cheng, Arman Cohan, Xiangliang Zhang

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a complex math problem, but instead of just reading numbers on a page, you are looking at a messy construction site. You have a blueprint (the text), a pile of bricks and beams (the image), and a list of instructions (the question).

For a long time, AI models were like students who could only read the instructions but were terrible at looking at the construction site. They would guess the answer based on the words alone, often ignoring the fact that a beam was actually broken or a measurement was wrong.

This paper introduces a new way to teach AI how to solve these "multimodal" math problems (problems that mix pictures and words). The authors call their new system the PAR Framework, which stands for Perception, Alignment, and Reasoning.

Think of solving a math problem with a picture as a three-step relay race:

1. Perception: The "Eagle Eye"

The Problem: Current AI models often have "bad eyesight." They might look at a graph and think a line is going up when it's actually going down, or they might miss a tiny "x" in a diagram.
The Solution: This step is about training the AI to be a super-observant detective. It needs to look at the image and say, "Okay, I see a red circle here, a blue line there, and the numbers on the side go from 0 to 10."

  • Analogy: Imagine you are a chef. Before you can cook, you must look at your ingredients. If you think you have salt but it's actually sugar, your dish will fail. "Perception" is the chef carefully checking every single ingredient to make sure they are real and correctly identified.

2. Alignment: The "Translator"

The Problem: Even if the AI sees the picture correctly, it doesn't know how to talk to its "math brain." The picture says "a long line," but the math brain needs to hear "length = 5." If they don't speak the same language, the AI gets confused.
The Solution: This step is about translation. The AI takes what it saw (the visual facts) and turns them into a language the math engine understands (like code, formulas, or logical steps).

  • Analogy: Think of this like a diplomat at a peace treaty. One side speaks "Image," and the other speaks "Math." The diplomat (Alignment) has to take the Image side's request ("The bridge is too short") and translate it into the Math side's language ("We need to add 3 meters to the span"). If the translation is wrong, the treaty fails.

3. Reasoning: The "Engineer"

The Problem: Once the AI has the facts and the translation, it has to actually solve the problem. Old models often just guessed the answer or took a shortcut.
The Solution: This step forces the AI to build the solution step-by-step, checking its work as it goes. It's not just about getting the right number; it's about proving how it got there.

  • Analogy: Imagine an architect building a skyscraper. They don't just wave a magic wand and hope the building stands. They lay one brick, check if it's level, lay the next, and check again. "Reasoning" is the architect double-checking every brick to ensure the tower doesn't fall over.

The New Rulebook: APE

The authors also realized that we were grading these AI students unfairly. We used to only check if their final answer was right (like a teacher just looking at the "A" or "F" on a test). But if the student got the right answer by guessing, they didn't really learn.

The paper proposes a new grading system called APE:

  • Answer: Did you get the right number? (The traditional test).
  • Process: Did you show your work? Did you make sense in the middle steps? (Did you use the right bricks?).
  • Executable: Can we run your solution like a computer program to prove it works? (Can we actually build the tower?).

Why Does This Matter?

Right now, AI is getting really good at math, but it's still "hallucinating" (making things up) when pictures are involved. This paper provides a roadmap to fix that.

By breaking the problem down into Seeing (Perception), Translating (Alignment), and Building (Reasoning), and by grading the AI on its Process and Proof, we can build AI that doesn't just guess the answer, but actually understands the world around it.

In short: This paper is the instruction manual for teaching AI to stop guessing and start truly "seeing" and "thinking" about math problems that involve pictures.