CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

CogFlow is a novel three-stage framework that enhances visual mathematical problem solving by introducing a knowledge internalization stage and specialized reward mechanisms to ensure extracted visual cues are faithfully integrated into reasoning, supported by a new high-quality dataset called MathCog.

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you are trying to solve a tricky geometry puzzle from a picture. You look at the diagram, you think about the math rules, and you write down the answer.

For a long time, Artificial Intelligence (AI) models have been terrible at this. They are like brilliant mathematicians who are blindfolded. They can recite all the math formulas perfectly, but when they try to look at the picture, they often "hallucinate." They might see a line where there isn't one, or they might forget that a specific angle is 90 degrees. They get the math right, but the facts are wrong.

The paper you shared, COGFLOW, introduces a new way to teach AI how to solve these visual math problems. Instead of just "looking and guessing," it teaches the AI to follow a three-step human-like process: See, Digest, and Solve.

Here is how it works, using some simple analogies:

1. The Problem: The "Reasoning Drift"

Imagine a detective trying to solve a crime.

  • Old AI: The detective looks at the crime scene photo, but their eyes are blurry. They think they see a red car, so they start building a whole theory about a red car. Even if the photo actually shows a blue truck, the detective keeps talking about the red car because they are confident in their wrong guess. This is called Reasoning Drift. The AI drifts away from the actual picture and starts making up facts.

2. The Solution: The Three-Stage "COGFLOW" Process

The authors realized that to fix this, the AI needs to slow down and process the information in three distinct stages, just like a human does.

Stage 1: Perception (The "High-Res Scanner")

  • The Analogy: Instead of just glancing at the picture, the AI acts like a forensic scanner. It doesn't just say "I see a circle." It measures the circle's exact center coordinates and radius. It doesn't just say "I see a line." It calculates the exact start and end points.
  • The Innovation: They created a special "reward system" (Synergistic Visual Rewards) that acts like a strict teacher. If the AI guesses the circle's size wrong, the teacher gives it a "bad grade" immediately. This forces the AI to be incredibly precise about what is actually in the picture before it tries to solve anything.

Stage 2: Internalization (The "Digestive System")

  • The Analogy: This is the most important new part. Imagine you eat a meal (the raw data from the scanner). Before you can write a report about the meal, you have to digest it. You can't just vomit the food back up and say "I ate a burger." You have to turn that food into energy and nutrients.
  • The Innovation: In this stage, the AI takes those raw numbers (coordinates, lines) and turns them into structured knowledge. It says, "Okay, I have these coordinates. Because of these coordinates, I know for a fact that 'Line A is the diameter of Circle B'."
  • Why it matters: This prevents the AI from forgetting the facts later. It "locks in" the visual evidence so it can't drift away from it. They call this the Knowledge Internalization Reward. It's like a checkpoint where the AI must prove it understands the picture before it's allowed to do the math.

Stage 3: Reasoning (The "Math Solver")

  • The Analogy: Now that the AI has "digested" the picture and has a clear, structured list of facts, it can finally do the math.
  • The Innovation: They added a Visual Gate. Imagine a security guard at the door of the math room. If the AI tries to enter with a bad description of the picture (e.g., "The circle is huge" when it's actually small), the guard stops it and says, "Go back and re-scan the picture!" The AI has to try again until it gets the "digestion" right. This ensures the final math solution is built on solid ground.

3. The New "Textbook" (MATHCOG Dataset)

To teach the AI this new way of thinking, the researchers didn't just give it old math problems. They created a new textbook called MATHCOG.

  • The Analogy: Old textbooks just had the question and the answer. This new textbook has three columns:
    1. What I saw: (The raw measurements).
    2. What I understood: (The internalized facts).
    3. The Solution: (The math).
  • By training on this, the AI learns that it must connect the "What I saw" directly to the "Solution."

The Result

When they tested this new system:

  • It stopped making up facts about the pictures.
  • It solved complex geometry problems much better than previous AI models, even beating some of the giant, expensive "black box" models from big tech companies.
  • It became more reliable, like a student who double-checks their notes before taking a test.

In summary: COGFLOW teaches AI to stop guessing and start verifying. It forces the AI to measure the picture, digest the facts, and only then do the math, ensuring the answer is actually grounded in reality.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →