Composition-Grounded Data Synthesis for Visual Reasoning

This paper introduces COGS, a data-efficient framework that synthesizes large-scale reasoning datasets by decomposing seed questions into primitive factors and recomposing them with new images, thereby significantly enhancing the visual reasoning capabilities of multi-modal large language models in annotation-scarce domains like charts and webpages.

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He

Published 2026-03-05
📖 4 min read☕ Coffee break read

Here is an explanation of the paper "Composition-Grounded Data Synthesis for Visual Reasoning" (COGS) using simple language and creative analogies.

The Big Problem: The "Smart but Stuck" Robot

Imagine you have a very smart robot (an AI) that can see pictures and read text. It's great at simple things like "What color is this car?" or "Read this sign."

However, when you ask it a tricky question like, "If the blue bar in this chart grows by 10%, and the red bar shrinks by 5%, which one is bigger now?", the robot often gets confused. It tries to guess the answer directly without doing the math step-by-step.

The problem is that we don't have enough "practice tests" for these tricky questions. Creating thousands of human-written questions with step-by-step answers for charts and webpages is expensive and slow. It's like trying to teach a student for a math exam, but you only have five practice problems to work with.

The Solution: COGS (The "Lego" Teacher)

The authors created a new method called COGS. Think of COGS as a master teacher who doesn't just give the student more practice tests; instead, it teaches the student how to build their own practice tests.

Here is how COGS works, broken down into three simple steps:

1. The "Deconstruction" (Taking the Lego apart)

Imagine you have a complex Lego castle (a hard question).

  • Old Way: You just show the student the whole castle and say, "Build this."
  • COGS Way: The teacher takes the castle apart. They separate it into its basic bricks: a "window brick," a "door brick," and a "roof brick."
  • In the Paper: The AI takes a hard question (e.g., "Calculate the difference in growth between two countries") and breaks it down into tiny, simple steps called Factors.
    • Factor 1: Read the number for Country A.
    • Factor 2: Read the number for Country B.
    • Factor 3: Subtract B from A.

2. The "Reconstruction" (Building new castles)

Now, the teacher has a box full of these basic "bricks" (Factors). They grab a new picture (a new chart or webpage) that the AI has never seen before.

  • They pick up the "Read Number" brick and stick it onto the new picture.
  • They pick up the "Subtract" brick and stick it on top of that.
  • Result: They instantly create a brand new, unique question based on the new picture, using the same logic as the old one.

The Analogy: It's like having a recipe for a chocolate cake. Instead of baking the same cake 1,000 times, you take the recipe (the factors: flour, sugar, eggs) and apply it to a new set of ingredients (a different chart) to bake a "Strawberry Cake" or a "Blueberry Cake." You can make infinite variations without needing a new recipe for every single cake.

3. The "Coach" (Rewarding the steps)

This is the secret sauce. When the AI practices these new questions, the teacher doesn't just check if the final answer is right or wrong.

  • Old Way: "You got the answer wrong. Try again." (The AI doesn't know where it messed up).
  • COGS Way: The teacher checks every single step. "You read the numbers correctly! Good job. But you subtracted them wrong. Let's fix that step."
  • In the Paper: They use a special reward system that gives points for getting the intermediate steps right, not just the final answer. This forces the AI to learn how to think logically, step-by-step.

Why This Matters

The paper tested this on Charts (graphs) and Webpages (screenshots of websites).

  • The Result: The AI became much better at solving hard problems it had never seen before.
  • The Magic: It didn't just memorize the specific charts it practiced on. Because it learned the "building blocks" (the factors), it could apply those skills to completely new types of charts and websites. It's like learning the rules of chess; once you know the rules, you can play against any opponent, not just the ones you practiced with.

Summary in a Nutshell

COGS is a smart way to teach AI how to reason. Instead of feeding it millions of pre-written questions, it:

  1. Breaks hard questions into tiny, simple steps.
  2. Mixes and matches those steps with new images to create endless new practice problems.
  3. Rewards the AI for getting the steps right, ensuring it learns how to think, not just what to guess.

It turns a small handful of examples into a massive, diverse gym for the AI to train its brain.