CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Imagine you are trying to solve a tricky geometry puzzle from a picture. You look at the diagram, you think about the math rules, and you write down the answer.

For a long time, Artificial Intelligence (AI) models have been terrible at this. They are like brilliant mathematicians who are blindfolded. They can recite all the math formulas perfectly, but when they try to look at the picture, they often "hallucinate." They might see a line where there isn't one, or they might forget that a specific angle is 90 degrees. They get the math right, but the facts are wrong.

The paper you shared, COGFLOW, introduces a new way to teach AI how to solve these visual math problems. Instead of just "looking and guessing," it teaches the AI to follow a three-step human-like process: See, Digest, and Solve.

Here is how it works, using some simple analogies:

1. The Problem: The "Reasoning Drift"

Imagine a detective trying to solve a crime.

Old AI: The detective looks at the crime scene photo, but their eyes are blurry. They think they see a red car, so they start building a whole theory about a red car. Even if the photo actually shows a blue truck, the detective keeps talking about the red car because they are confident in their wrong guess. This is called Reasoning Drift. The AI drifts away from the actual picture and starts making up facts.

2. The Solution: The Three-Stage "COGFLOW" Process

The authors realized that to fix this, the AI needs to slow down and process the information in three distinct stages, just like a human does.

Stage 1: Perception (The "High-Res Scanner")

The Analogy: Instead of just glancing at the picture, the AI acts like a forensic scanner. It doesn't just say "I see a circle." It measures the circle's exact center coordinates and radius. It doesn't just say "I see a line." It calculates the exact start and end points.
The Innovation: They created a special "reward system" (Synergistic Visual Rewards) that acts like a strict teacher. If the AI guesses the circle's size wrong, the teacher gives it a "bad grade" immediately. This forces the AI to be incredibly precise about what is actually in the picture before it tries to solve anything.

Stage 2: Internalization (The "Digestive System")

The Analogy: This is the most important new part. Imagine you eat a meal (the raw data from the scanner). Before you can write a report about the meal, you have to digest it. You can't just vomit the food back up and say "I ate a burger." You have to turn that food into energy and nutrients.
The Innovation: In this stage, the AI takes those raw numbers (coordinates, lines) and turns them into structured knowledge. It says, "Okay, I have these coordinates. Because of these coordinates, I know for a fact that 'Line A is the diameter of Circle B'."
Why it matters: This prevents the AI from forgetting the facts later. It "locks in" the visual evidence so it can't drift away from it. They call this the Knowledge Internalization Reward. It's like a checkpoint where the AI must prove it understands the picture before it's allowed to do the math.

Stage 3: Reasoning (The "Math Solver")

The Analogy: Now that the AI has "digested" the picture and has a clear, structured list of facts, it can finally do the math.
The Innovation: They added a Visual Gate. Imagine a security guard at the door of the math room. If the AI tries to enter with a bad description of the picture (e.g., "The circle is huge" when it's actually small), the guard stops it and says, "Go back and re-scan the picture!" The AI has to try again until it gets the "digestion" right. This ensures the final math solution is built on solid ground.

3. The New "Textbook" (MATHCOG Dataset)

To teach the AI this new way of thinking, the researchers didn't just give it old math problems. They created a new textbook called MATHCOG.

The Analogy: Old textbooks just had the question and the answer. This new textbook has three columns:
1. What I saw: (The raw measurements).
2. What I understood: (The internalized facts).
3. The Solution: (The math).
By training on this, the AI learns that it must connect the "What I saw" directly to the "Solution."

The Result

When they tested this new system:

It stopped making up facts about the pictures.
It solved complex geometry problems much better than previous AI models, even beating some of the giant, expensive "black box" models from big tech companies.
It became more reliable, like a student who double-checks their notes before taking a test.

In summary: COGFLOW teaches AI to stop guessing and start verifying. It forces the AI to measure the picture, digest the facts, and only then do the math, ensuring the answer is actually grounded in reality.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have made significant strides but continue to struggle with visual mathematical problem solving (e.g., geometry diagrams, algebraic plots). The paper identifies two primary failure modes in existing approaches:

Perception-Reasoning Decoupling: While some methods separate perception (visual recognition) from reasoning (inference), they often suffer from Reasoning Drift. In these cases, the reasoning stage generates logical steps that appear coherent but are ungrounded in the actual visual evidence (e.g., hallucinating geometric properties not present in the diagram).
Lack of Internalization: Existing works focus on improving the extraction of visual cues but ignore the critical intermediate step of Knowledge Internalization. They fail to ensure that extracted visual primitives are faithfully transformed into structured, reasoning-ready knowledge before the inference stage begins.

2. Methodology: The COGFLOW Framework

COGFLOW is a cognitive-inspired, three-stage framework that mimics the hierarchical human reasoning flow: Perception $\Rightarrow$ Internalization $\Rightarrow$ Reasoning. It is implemented as a Reinforcement Learning (RL) framework designed to holistically enhance all three stages.

A. The Three-Stage Pipeline

Perception: Extracts raw visual information (points, lines, circles, coordinates) from the diagram.
Internalization: Transforms low-level perceptual signals into structured, semantically grounded knowledge representations (e.g., converting "Point A is at (x,y)" and "Line AB is a diameter" into the geometric fact "Angle ACB = 90°").
Reasoning: Performs multi-step logical inference based on the internalized knowledge to solve the problem.

B. Key Technical Components

The framework employs three specific mechanisms to enforce this flow:

1. Synergistic Visual Rewards (SynVRs) for Perception
To boost perception capabilities, COGFLOW introduces a dual-reward system that operates in both parametric and semantic spaces:

Visual Parameterized Reward (VPR): Converts visual primitives into parametric equations (e.g., circle equations). It uses the Hungarian algorithm to match predicted primitives against ground truth in a parameter space, calculating Euclidean distance to ensure local geometric fidelity.
Visual Semantic Reward (VSR): Renders the predicted text output back into an image and compares it with the ground-truth image using a frozen FG-CLIP encoder. It measures cosine similarity in semantic space to ensure global layout and style consistency.
Synergy: These rewards are combined to provide a unified supervision signal that prevents low-quality perceptions from propagating.

2. Knowledge Internalization Reward (IntlzR)
To bridge the gap between perception and reasoning, COGFLOW introduces a reward model trained to detect Reasoning Drift.

Mechanism: It evaluates whether the reasoning chain remains faithful to the internalized visual representation.
Training Data: The model is trained on the MATHCOG-IntlzR subset, which consists of positive trajectories (faithful internalization) and five types of negative trajectories generated by injecting specific errors:
1. Omitting or misbinding primitives.
2. Introducing nonexistent facts.
3. Contradicting geometric constraints.
4. Inappropriately invoking external theorems.
5. Inconsistent referencing of established elements.
Optimization: Uses Softmax-DPO to contrast one positive trajectory against multiple negatives, effectively penalizing hallucinations and ungrounded reasoning.

3. Visual-Gated Policy Optimization (VGPO)
To stabilize multi-step reasoning, COGFLOW employs a Visual Gate within the RL loop.

Function: Before generating a reasoning trajectory, the model samples multiple perception candidates. The Visual Gate scores these candidates based on perceptual accuracy (using SynVRs).
Filtering: Only high-quality perception trajectories (those exceeding a threshold $\tau$ ) are passed to the reasoning stage. If a low-quality trajectory is generated, the model is forced to re-generate until a high-quality one is found.
Optimization: The policy is optimized using a group-level reward that integrates SynVRs, IntlzR, and an Inference Reward (outcome supervision), ensuring the model learns to prioritize accurate perception as the foundation for reasoning.

C. The MATHCOG Dataset

To support this training, the authors curated MATHCOG, a new dataset containing over 120,000 samples.

Structure: It explicitly disentangles the process into <WATCHING> (perception), <THINKING> (internalization + reasoning), and <ANSWER>.
Subsets:
- MATHCOG-SFT: 100k samples for Supervised Fine-Tuning.
- MATHCOG-RL: 10k samples for Reinforcement Learning.
- MATHCOG-IntlzR: 10k positive / 50k negative pairs for training the IntlzR reward model.

3. Key Contributions

Cognitive-Inspired Framework: COGFLOW is the first framework to explicitly model and enforce the Knowledge Internalization stage, addressing the critical gap between visual extraction and logical inference.
Novel Reward Mechanisms: The introduction of SynVRs (combining parametric and semantic rewards) and IntlzR (detecting reasoning drift) provides a robust method for aligning perception with reasoning.
Visual-Gated Optimization: The VGPO algorithm introduces a visual gate that filters perception trajectories, preventing the model from reasoning over hallucinated visual data.
MATHCOG Dataset: A large-scale, high-quality dataset with aligned perception-reasoning annotations, facilitating future research in visual mathematical reasoning.

4. Experimental Results

The authors evaluated COGFLOW (based on the Qwen2.5-VL-7B backbone) against state-of-the-art open-source and closed-source MLLMs on six major benchmarks: FlowVerse, MathVerse, MathVista, WeMath, LogicVista, and DynaMath.

Overall Performance: COGFLOW-7B achieved 66.0% accuracy on FlowVerse and 76.8% on MathVista, significantly outperforming all open-source baselines (e.g., MultiMath-7B, SVE-Math-7B) and matching or exceeding much larger closed-source models (e.g., GPT-5, Gemini-2.5-Pro).
Reasoning Quality: The model showed substantial improvements in Chain-of-Thought Evaluation (CoT-E), indicating that the intermediate reasoning steps are more logically sound and grounded.
Ablation Studies:
- Removing SynVRs led to a drop in geometric precision.
- Removing IntlzR resulted in a significant increase in "Knowledge Internalization Errors" (reasoning drift).
- Removing VGPO reduced the stability of long-horizon reasoning.
- The combination of all three components yielded the best results, confirming the necessity of the holistic approach.

5. Significance

Solving the "Reasoning Drift" Problem: COGFLOW directly addresses the issue where MLLMs generate plausible-sounding but visually ungrounded reasoning chains. By enforcing a "Perception $\to$ Internalization $\to$ Reasoning" flow, it ensures that every logical step is anchored in verified visual evidence.
Cognitive Alignment: The work bridges cognitive science (knowledge internalization) with deep learning, suggesting that mimicking human cognitive stages leads to more robust AI reasoning.
Scalability and Generalization: The framework is model-agnostic (demonstrated on Qwen and InternVL families) and shows that smaller models (7B) equipped with this framework can outperform larger, unoptimized models, offering a cost-effective path to high-performance visual reasoning.

In conclusion, COGFLOW represents a paradigm shift from treating visual math problems as a simple "input-output" task to a structured, multi-stage cognitive process, significantly advancing the state of the art in visual mathematical reasoning.