COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Imagine you are teaching a robot how to clean your room. You show it how to pick up a sock and put it in the hamper. You show it how to pick up a book and put it on the shelf.

If the robot is truly smart, it should be able to figure out how to pick up a sock and then put it on the shelf, or pick up a book and then put it in the hamper, even if you never specifically showed it those exact combinations. This ability to mix and match learned skills to solve new problems is called compositional generalization. It's a superpower of human intelligence, but for Artificial Intelligence (AI), it's like trying to teach a dog to do calculus.

This paper introduces a new tool called COGITAO (a mouthful, so let's just call it the "Lego Lab") designed to test exactly how good AI is at this kind of mixing-and-matching.

The Problem: AI is a "Pattern Matcher," Not a "Thinker"

Current AI models (like the ones powering chatbots or image generators) are incredibly good at memorizing patterns. If you show them a million pictures of a cat, they can spot a cat. But if you ask them to do something slightly new—like "rotate the cat and then make it blue"—they often get confused. They tend to just guess based on what they've seen before, rather than actually understanding the rules of how to combine actions.

The researchers wanted a way to test this without the messiness of the real world (like bad lighting or messy rooms). They needed a clean, controlled environment.

The Solution: The "Lego Lab" (COGITAO)

Think of COGITAO as a giant, infinite digital sandbox made of grids (like graph paper).

The Objects: Instead of real cats or cars, the AI sees simple shapes (squares, circles, weird blobs) made of colored pixels.
The Actions: The researchers created a "toolbox" of 28 simple moves. You can rotate a shape, move it up, flip it, change its color, or cut a piece off.
The Game: The AI is given an "Input Grid" (a picture of shapes) and a "Rule" (a list of moves, like "Rotate 90 degrees, then move up"). It has to draw the "Output Grid" (what the picture looks like after the moves).

The magic of COGITAO is that it can generate millions of unique puzzles. It can make the rules easy (just move one shape) or incredibly hard (rotate three shapes, flip two, and change the colors of all of them in a specific order).

The Experiment: Testing the AI's Brain

The researchers took several state-of-the-art AI models (some designed for vision, some for language, some designed to "think" step-by-step) and put them through the COGITAO Lab.

They ran two main types of tests:

The "Mix-and-Match" Test (Compositional Generalization):
- Training: The AI learns to "Rotate" and "Move Up" separately. It also learns to do "Rotate then Move Up."
- The Test: The AI is asked to do "Move Up then Rotate."
- The Result: Even though the AI knew both moves perfectly, it failed miserably when asked to swap the order. It was like a chef who knows how to chop onions and how to fry eggs, but when asked to fry an egg then chop an onion, they just fried the onion and chopped the egg.
The "New Environment" Test (Systematic Generalization):
- Training: The AI learns to move shapes on a small 10x10 grid with 2 shapes.
- The Test: The AI is asked to move shapes on a huge 20x20 grid with 10 shapes.
- The Result: The AI got confused. It couldn't scale its logic up. It was like teaching someone to drive in a parking lot, then expecting them to drive a Formula 1 car on a highway immediately.

The Big Surprise: "Stubbornness"

The paper found something fascinating about how the AI failed. They called it "Stubbornness."

When the AI faced a new, tricky puzzle, it didn't try to figure out the new rules. Instead, it just ignored the new instructions and did what it was trained to do most often.

Example: If the AI was trained mostly on "Move Right," and you asked it to "Move Left," it would often just "Move Right" anyway. It was too lazy to learn the new rule and just defaulted to its old habit.

Why Does This Matter?

You might think, "So what? It's just a grid game."

But this is a huge deal for the future of AI.

Real-World Robots: If a robot can't learn to "open the fridge" and "get the milk" separately and then combine them to "get the milk from the fridge," it will never be useful in a real house.
True Intelligence: Humans can learn a few basic concepts and combine them in infinite ways. Current AI is stuck in a loop of memorization. COGITAO proves that simply making AI bigger or training it on more data isn't the answer. We need to build AI that actually understands how to combine ideas, not just how to copy them.

The Takeaway

COGITAO is like a stress test for the human-like reasoning of AI. It shows us that while our current AI is a brilliant memorizer, it is still a terrible "combinator." It can't easily mix and match its skills to solve new problems.

The paper concludes that until we can build AI that passes the COGITAO test, we are still far from creating machines that truly think like humans. We are building smart parrots that can repeat what they've heard, but we haven't yet built the thinkers who can write their own songs.

1. Problem Statement

The paper addresses a fundamental limitation in modern machine learning: the inability of state-of-the-art models to achieve compositional and systematic generalization. While humans can learn "atomic" concepts (e.g., a specific transformation) and effortlessly combine them into novel, complex rules (e.g., "rotate and translate"), current deep learning models often fail when tested on combinations of familiar elements they have not seen during training.

Existing visual benchmarks for this problem (e.g., CLEVR, dSprites) suffer from:

Limited Flexibility: They lack fine-grained control over compositional structure.
Conflated Complexity: They often mix visual complexity with relational structure, making it difficult to isolate reasoning failures.
Static Nature: They do not easily support the generation of millions of unique tasks with adjustable difficulty.

2. Methodology: The COGITAO Framework

The authors introduce COGITAO (Compositional Generalization In Transformations And Objects), a procedural, object-centric data generation framework inspired by the ARC-AGI environment.

Core Components

Object-Centric Domain: The framework operates on grid-based environments containing discrete objects. It samples from a pre-generated pool of 23,000 objects varying in size, symmetry, connectivity, and color patterns.
Atomic Transformations: The system utilizes a set of 28 primitive, interoperable transformations (e.g., translate, rotate 90°, mirror, crop, fill, duplicate, change color).
- Design Rules: Transformations are designed to be composable with one another and non-redundant (no single transformation is systematically equivalent to a combination of others).
Procedural Generation:
- Task Definition: A task is defined by a sequence of transformations applied to objects on a grid.
- Scalability: With 28 transformations and adjustable depth $k$ , the theoretical number of unique tasks is $28^k$ . For depth 5, this yields $\approx 1.7 \times 10^8$ unique tasks.
- Control: Users can adjust grid size, object count, object complexity, and transformation depth.
Extensions:
- Sequential-COGITAO: Extends the framework to World-Model research by exposing intermediate states in the transformation chain.
- COGITAO-RGB: Renders tasks as standard 128x128 RGB images to bridge the gap between synthetic grids and natural vision, removing explicit grid cues.

Experimental Design

The authors define two primary studies to test generalization:

Compositional Generalization (CompGen): Models are trained on specific transformation sequences (e.g., depth 1 or 2) and tested on unseen combinations or deeper sequences (e.g., depth 3) using the same atomic primitives.
Environmental Generalization (EnvGen): Models are trained on specific environmental parameters (e.g., 1-2 objects, 10x10 grids) and tested on out-of-distribution (OOD) parameters (e.g., 3-4 objects, 20x20 grids).

3. Key Contributions

COGITAO Generator: A flexible, open-source framework capable of generating millions of unique, controllable input-output rules at adjustable composition depths, surpassing existing datasets by orders of magnitude.
Benchmark Datasets: Release of multiple benchmark datasets targeting specific aspects of generalization (CompGen and EnvGen), enabling reproducible experimentation.
Extensions for Real-World Transfer: Introduction of RGB renderings and sequential action frameworks to facilitate research in natural vision and world models.
Comprehensive Baselines: Evaluation of diverse state-of-the-art architectures, including:
- Vanilla Transformer (TF): Standard baseline.
- Grid-TF: A Transformer adapted with object-centric inductive biases (Object Positional Encoding, Registers, Relative Positional Encoding).
- Pondering Looped TF (PL-TF): A recurrent Transformer with adaptive computation time (inspired by PonderNet) to simulate iterative reasoning.
- LLaDA: A diffusion-based language model.
- ResNet: A standard convolutional baseline.

4. Results

The experiments reveal a consistent and critical failure mode across all architectures:

In-Domain (ID) Performance: Models generally perform well on tasks similar to their training distribution (e.g., Grid-TF achieves ~80-90% accuracy on ID tasks).
Out-of-Distribution (OOD) Performance:
- Compositional Failure: When faced with unseen combinations of known transformations (e.g., "rotate + translate" when trained only on "rotate" and "translate" separately), performance drops precipitously, often to near 0%.
- Environmental Failure: Models struggle to generalize to larger grids, more objects, or more complex shapes, though ResNet shows some robustness to spatial variations due to convolutional inductive biases.
- Scaling Does Not Help: Increasing model size (up to 25M parameters) or training data volume (up to 1M samples) does not significantly close the generalization gap.
- LLM Performance: Even frontier Large Language Models (e.g., o3, Gemini) fail in In-Context Learning settings when asked to compose unseen transformation sequences, confirming the issue is architectural rather than data-limited.

Key Failure Modes Identified

ID Bias: Models apply the transformation seen most frequently during training, ignoring the specific OOD instruction.
Structural Composition Failure: Models can memorize specific sequences (high ID performance) but fail to decompose them into atomic parts to recombine them for OOD tasks.

5. Significance and Conclusion

Diagnosis of Current AI: COGITAO provides a controlled environment proving that current deep learning architectures rely heavily on pattern matching and memorization rather than true systematic reasoning. They fail to learn the underlying "grammar" of object transformations.
Future Direction: The paper argues that achieving human-like intelligence requires architectures that move beyond statistical correlation toward structured, compositional understanding.
Research Platform: By offering a scalable, controllable, and open framework, COGITAO serves as a critical diagnostic tool for developing next-generation models capable of robust generalization in both abstract and real-world visual domains.

In summary, COGITAO demonstrates that despite the success of large-scale models in in-domain tasks, the ability to systematically compose and recombine learned concepts remains a persistent, unsolved challenge in machine learning.