Composition-Grounded Data Synthesis for Visual Reasoning

Here is an explanation of the paper "Composition-Grounded Data Synthesis for Visual Reasoning" (COGS) using simple language and creative analogies.

The Big Problem: The "Smart but Stuck" Robot

Imagine you have a very smart robot (an AI) that can see pictures and read text. It's great at simple things like "What color is this car?" or "Read this sign."

However, when you ask it a tricky question like, "If the blue bar in this chart grows by 10%, and the red bar shrinks by 5%, which one is bigger now?", the robot often gets confused. It tries to guess the answer directly without doing the math step-by-step.

The problem is that we don't have enough "practice tests" for these tricky questions. Creating thousands of human-written questions with step-by-step answers for charts and webpages is expensive and slow. It's like trying to teach a student for a math exam, but you only have five practice problems to work with.

The Solution: COGS (The "Lego" Teacher)

The authors created a new method called COGS. Think of COGS as a master teacher who doesn't just give the student more practice tests; instead, it teaches the student how to build their own practice tests.

Here is how COGS works, broken down into three simple steps:

1. The "Deconstruction" (Taking the Lego apart)

Imagine you have a complex Lego castle (a hard question).

Old Way: You just show the student the whole castle and say, "Build this."
COGS Way: The teacher takes the castle apart. They separate it into its basic bricks: a "window brick," a "door brick," and a "roof brick."
In the Paper: The AI takes a hard question (e.g., "Calculate the difference in growth between two countries") and breaks it down into tiny, simple steps called Factors.
- Factor 1: Read the number for Country A.
- Factor 2: Read the number for Country B.
- Factor 3: Subtract B from A.

2. The "Reconstruction" (Building new castles)

Now, the teacher has a box full of these basic "bricks" (Factors). They grab a new picture (a new chart or webpage) that the AI has never seen before.

They pick up the "Read Number" brick and stick it onto the new picture.
They pick up the "Subtract" brick and stick it on top of that.
Result: They instantly create a brand new, unique question based on the new picture, using the same logic as the old one.

The Analogy: It's like having a recipe for a chocolate cake. Instead of baking the same cake 1,000 times, you take the recipe (the factors: flour, sugar, eggs) and apply it to a new set of ingredients (a different chart) to bake a "Strawberry Cake" or a "Blueberry Cake." You can make infinite variations without needing a new recipe for every single cake.

3. The "Coach" (Rewarding the steps)

This is the secret sauce. When the AI practices these new questions, the teacher doesn't just check if the final answer is right or wrong.

Old Way: "You got the answer wrong. Try again." (The AI doesn't know where it messed up).
COGS Way: The teacher checks every single step. "You read the numbers correctly! Good job. But you subtracted them wrong. Let's fix that step."
In the Paper: They use a special reward system that gives points for getting the intermediate steps right, not just the final answer. This forces the AI to learn how to think logically, step-by-step.

Why This Matters

The paper tested this on Charts (graphs) and Webpages (screenshots of websites).

The Result: The AI became much better at solving hard problems it had never seen before.
The Magic: It didn't just memorize the specific charts it practiced on. Because it learned the "building blocks" (the factors), it could apply those skills to completely new types of charts and websites. It's like learning the rules of chess; once you know the rules, you can play against any opponent, not just the ones you practiced with.

Summary in a Nutshell

COGS is a smart way to teach AI how to reason. Instead of feeding it millions of pre-written questions, it:

Breaks hard questions into tiny, simple steps.
Mixes and matches those steps with new images to create endless new practice problems.
Rewards the AI for getting the steps right, ensuring it learns how to think, not just what to guess.

It turns a small handful of examples into a massive, diverse gym for the AI to train its brain.

Here is a detailed technical summary of the paper "Composition-Grounded Data Synthesis for Visual Reasoning" (COGS), published as a conference paper at ICLR 2026.

1. Problem Statement

Pretrained Multi-Modal Large Language Models (MLLMs) have achieved strong performance on general multimodal tasks but struggle with advanced visual reasoning in specific domains like charts, rendered documents, and webpages.

The Bottleneck: While images in these "artificial" domains are abundant on the web, high-quality, human-annotated datasets containing complex reasoning questions are scarce. Collecting such data is expensive and time-consuming.
The Challenge: Existing methods often rely on hand-crafted heuristics or simple template-based data synthesis, which fail to capture the compositional nature of complex reasoning (e.g., multi-hop logic, arithmetic, and spatial relations). Furthermore, models trained on limited data often overfit to specific datasets rather than learning transferable reasoning capabilities.

2. Methodology: COGS Framework

The authors propose COGS (COmposition-Grounded data Synthesis), a data-efficient framework that bootstraps advanced reasoning capabilities from a small set of seed questions (e.g., 33% of a test set) without requiring ground-truth answers for the seed data itself. The framework operates in three stages:

Stage 1: Seed Data Decomposition

Goal: Break down complex seed questions into primitive, interpretable factors.
Process: An MLLM is prompted to decompose a question $q$ $q$ into a sequence of factors $\{f_1, f_2, ..., f_k\}$ ${f_{1}, f_{2}, ..., f_{k}}$ .
- Perception Factors: Identifying specific visual elements (e.g., "read the value for Nigeria").
- Reasoning Factors: Logical or mathematical operations (e.g., "calculate the difference," "compare two values").
Output: A Factor Pool ( $\mathcal{F}$ ) containing category labels (e.g., Calculation, Counting, Comparison) and associated exemplar subquestions.

Stage 2: Factor Recomposition (Data Synthesis)

Goal: Generate a large, diverse synthetic dataset by recombining factors with new, unlabeled images.
Process:
1. Sample a subset of factors from the Factor Pool $\mathcal{F}$ .
2. Select a new image $I$ from an unlabeled source (e.g., online charts or webpages).
3. Prompt an MLLM to generate a new complex question $q'$ and its corresponding subquestions $\{s_i\}$ grounded on image $I$ , strictly following the logic of the sampled factors.
4. The model generates intermediate answers for subquestions and a final answer for the composite question.
Advantage: This creates a vast training distribution of compositional questions without manual annotation, leveraging the structural regularity of the domain.

Stage 3: Reinforcement Learning Fine-Tuning

Goal: Train the MLLM using the synthetic data with fine-grained supervision.
Algorithm: Group Relative Policy Optimization (GRPO).
Reward Design: Unlike standard RL which only rewards the final answer, COGS utilizes Process Rewards derived from the factor decomposition:
- StandardRM: Rewards only the final answer correctness.
- ProcessRM-sum: Rewards final answer + average subquestion accuracy.
- ProcessRM-max (Proposed): Rewards the maximum of (Final Answer, $\lambda \times$ Subquestion Accuracy).
- Theoretical Insight: The authors prove that ProcessRM-max preserves policy ordering even when sub-rewards are noisy, whereas ProcessRM-sum can lead to incorrect policy rankings due to noise accumulation.

3. Key Contributions

Composition-Grounded Synthesis: A novel framework that decomposes questions into atomic reasoning factors and recomposes them to generate scalable, diverse training data from small seed sets.
Process-Level Reinforcement Learning: Introduction of a factor-level reward mechanism (ProcessRM-max) that guides models through intermediate reasoning steps, significantly improving multi-hop reasoning capabilities.
Cross-Dataset Transferability: Demonstration that training on a factor-level mixture of datasets (merging factors from different sources before recomposition) yields better generalization than simple data-level mixing, preventing overfitting to specific dataset distributions.
Broad Applicability: Validation of the framework across two distinct domains: Chart Reasoning (ChartQAPro, MMC) and Webpage GUI Understanding (VisualWebBench).

4. Experimental Results

The authors evaluated COGS primarily on the ChartQAPro benchmark and VisualWebBench, comparing against proprietary models (GPT-4o, Claude), open-source general MLLMs (Qwen2.5-VL), and specialist chart models.

Performance Gains:
- On ChartQAPro, COGS fine-tuned on Qwen2.5-VL-7B achieved 52.02% accuracy, outperforming the base model (47.36%) and all other open-source baselines.
- It showed the most significant improvements on reasoning-heavy and compositional questions (e.g., multi-hop counting, extrapolation).
- On VisualWebBench, COGS achieved 88.04% accuracy, surpassing specialist models and inference-time decomposition strategies.
Ablation Studies:
- Reward Models: ProcessRM-max consistently outperformed ProcessRM-sum and StandardRM, validating the theoretical analysis regarding noisy intermediate signals.
- Data Mixing: Factor-level mixture (merging factors from ChartQAPro and MMC before generation) outperformed Data-level mixture, proving that factorization captures shared structural logic better than raw data concatenation.
- Seed Size: Performance scaled with seed size, but even small seeds (1-5%) provided substantial gains over the base model.
Qualitative Analysis: Models trained with COGS demonstrated better adherence to step-by-step reasoning, avoiding common pitfalls like "shortcutting" (guessing the answer without calculation) or number-insensitivity.

5. Significance and Impact

Data Efficiency: COGS addresses the critical bottleneck of data scarcity in specialized visual reasoning domains. It enables the creation of high-quality reasoning datasets from a tiny fraction of existing data and unlabeled images.
Generalization: By focusing on the composition of reasoning rather than surface-level patterns, COGS induces transferable reasoning skills. Models trained with this method generalize well across different chart types and even to webpages, suggesting a path toward more robust, general-purpose visual agents.
Training Paradigm: The work highlights the importance of process rewards over final-answer rewards in RL training for complex reasoning tasks. It provides a theoretical and empirical basis for using intermediate supervision to stabilize policy optimization in the presence of noisy signals.
Future Directions: The framework opens avenues for extending visual reasoning to long-context documents, integrating with pretraining, and applying these capabilities to downstream agentic tasks (e.g., editing documents or navigating web interfaces).

In summary, COGS offers a principled, scalable approach to unlocking advanced visual reasoning in MLLMs by leveraging the compositional structure of questions, effectively bridging the gap between limited annotated data and the need for robust reasoning capabilities.