Compositional Neuro-Symbolic Reasoning

Imagine you are trying to teach a robot to solve a puzzle game called ARC (Abstraction and Reasoning Corpus). In this game, you show the robot a few pictures of a grid with colored squares, along with the "answer" grid, and ask it to figure out the rule to solve a brand new, unseen puzzle.

The paper you shared describes a new way to teach this robot, which the authors call Compositional Neuro-Symbolic Reasoning.

Here is the simple explanation, using a few analogies to make it click.

The Problem: Two Flawed Approaches

Before this new method, there were two main ways to try to solve these puzzles, and both had big holes in them:

The "Giant Brain" Approach (Pure Neural/LLMs):
Imagine a student who has read every book in the library but has never actually practiced math. They are great at guessing patterns based on what they've seen before. If you show them a puzzle, they might guess the answer by saying, "I've seen this color before, so I'll guess blue."
- The Flaw: They are good at guessing, but bad at logic. If the puzzle requires a specific, step-by-step rule (like "move the red square two steps right, then turn it blue"), they often get confused or make up rules that don't actually work. They rely on "vibes" rather than strict logic.
The "Strict Accountant" Approach (Pure Symbolic):
Imagine a robot that is incredibly logical but has no eyes. It knows the rules of math perfectly but can't tell a red square from a blue circle. It tries to solve the puzzle by writing down every single possible rule in the universe and checking them one by one.
- The Flaw: There are too many rules! It takes forever (or never finishes) because it's trying to check every possibility. It also gets stuck if it can't "see" the objects clearly in the first place.

The Solution: The "Architect and the Foreman" Team

The authors propose a Neuro-Symbolic system. Think of this as a construction site with two distinct roles working together:

1. The Foreman (The "Neural" Part)

Role: This is the "eyes" of the system. It looks at the messy grid of colored squares and says, "Okay, I see a red square here, a blue line there, and a hole in the middle."
What it does: It breaks the messy picture down into clean, named objects (like "Red Square," "Blue Line"). It doesn't try to solve the puzzle yet; it just organizes the scene so the next person can understand it.
Analogy: It's like a translator who turns a messy scribble into a clear, typed sentence.

2. The Architect (The "Symbolic" Part)

Role: This is the "brain" with a strict rulebook. It doesn't look at pixels; it looks at the objects the Foreman found.
The Rulebook (DSL): The Architect has a small, fixed list of 22 "atomic moves" it is allowed to make. Think of these like LEGO bricks. You can only build with these specific bricks:
- Move Brick A to the right.
- Fill a hole with color B.
- Connect two bricks with a bridge.
- Rotate the whole structure.
What it does: Instead of guessing, the Architect looks at the examples and asks, "Which combination of these 22 LEGO moves turns the 'Input' into the 'Output'?" It tests these combinations strictly.

The Secret Sauce: The "Group Consensus" Filter

Here is where the magic happens. The system doesn't just pick one guess.

The Proposal: The Foreman and Architect work together to generate a list of possible rules that explain the first example.
The Consistency Check: The system then takes those rules and tests them on the other examples.
- Analogy: Imagine you have three suspects in a mystery. You ask Suspect A, "Did you do it?" They say "Yes, I moved the vase." You check the other witnesses. If Suspect A's story doesn't fit what the other witnesses saw, you cross them off the list.
The Winner: The system keeps only the rules that work for ALL the examples perfectly. If a rule works for Example 1 but fails Example 2, it is thrown out. This ensures the rule is truly general and not just a lucky guess.

Why is this better?

No Brute Force: The Architect doesn't check every rule in the universe; it only checks the 22 specific "LEGO moves" that make sense. This is fast.
No Hallucinations: Because the Foreman organizes the objects first, the Architect isn't confused by messy pixels.
Strict Logic: By forcing the rule to work on every example, the system avoids the "lucky guess" problem that plagues AI.

The Results

When they tested this on the ARC-AGI-2 benchmark (a very hard test of fluid intelligence):

Standard AI models (just the "Giant Brain") got about 16% right.
Their new "Architect + Foreman" team got 24.4% right.
When they combined this team with another smart solver using a "Meta-Classifier" (a referee that picks the best answer between the two), they hit 30.8%.

The Big Takeaway

The paper argues that to build truly intelligent machines, we shouldn't just make bigger, smarter "black boxes" (bigger AI models). Instead, we should build systems that separate seeing from thinking.

See first: Cleanly identify the objects.
Think second: Apply a strict, limited set of logical rules to those objects.
Verify: Make sure the rule works everywhere, not just once.

It's the difference between a student who guesses the answer based on a hunch, and a detective who gathers evidence, checks every clue against the facts, and solves the case with a logical chain that holds up in court.

1. Problem Statement

The paper addresses the Abstraction and Reasoning Corpus (ARC-AGI-2), a benchmark designed to measure "fluid intelligence" by requiring models to infer abstract transformation rules from a minimal number of examples (few-shot) and apply them to novel inputs.

The Core Challenge: Current AI systems struggle with ARC-AGI-2 due to a fundamental trade-off:
- Purely Neural Models (LLMs): While powerful at pattern recognition, they lack reliable combinatorial generalization. They often entangle perception with rule induction, leading to brittle extrapolations when faced with novel compositions of rules.
- Purely Symbolic Systems: While interpretable, they suffer from combinatorial explosion when searching over high-resolution grids and complex multi-step transformations. They also struggle with perceptual grounding (interpreting raw pixel grids into symbolic objects).
Goal: To develop an architecture that separates perceptual abstraction from rule induction, constrains the search space using a structured language, and enforces strict consistency across examples without relying on brute-force search or massive test-time scaling.

2. Methodology: A Four-Stage Neuro-Symbolic Pipeline

The authors propose a compositional neuro-symbolic architecture that decouples perception, hypothesis generation, and verification. The pipeline operates on a grid of 10 discrete colors.

Stage 1: Structured Symbolic Scene Abstraction

The system converts raw input grids into a structured symbolic scene graph.

Object Extraction: Non-background pixels are partitioned into objects using 8-connected component analysis.
Feature Parameterization: Each object is represented by a structured feature vector $\phi(o_j)$ $ϕ (o_{j})$ including:
- Bounding box (min/max coordinates, width, height).
- Centroid (global position).
- Canonical shape representation (normalized coordinates).
- Color histograms.
- Cavity detection: Identification of enclosed background regions within objects.
Background Estimation: The background is defined as the most frequent color (mode) in the grid.
Hybrid Enhancement: While low-level features are computed algorithmically, higher-level descriptors (e.g., shape labels, cavity cues) are enriched using Large Language Models (LLMs) when heuristic extraction is ambiguous.

Stage 2: Neural-Guided Hypothesis Generation

Instead of exhaustively searching the program space, the system uses a neural prior to propose candidate transformations from a fixed Domain-Specific Language (DSL).

The DSL (Unit Patterns): A curated library of 22 atomic visual reasoning patterns (e.g., "Horizontal Fill," "Connecting Bridges," "Cavity Fill," "Gravity Effect"). These are parameterized operations (e.g., source object, fill color, stop condition).
Proposal Mechanism: An LLM (o4-mini) analyzes the structural differences between input and output scene graphs to propose a ranked list of candidate programs (sequences of DSL primitives) that could explain the transformation.
Constraint: The neural model operates only within the symbolic hypothesis space, proposing compositions of the 22 patterns rather than generating raw pixels.

Stage 3: Cross-Example Consistency Filtering

This stage enforces global consistency across all training examples ( $k$ pairs).

Verification: Candidate programs are executed symbolically on the input scene graphs. A program is valid for an example if its symbolic execution, when rendered back to a grid, matches the ground-truth output exactly.
Intersection: The system computes the intersection of valid programs across all training examples ( $\Pi^* = \bigcap \Pi_i$ ). Only hypotheses that explain all examples simultaneously are retained.
Parsimony: If multiple valid programs exist, the one with the minimum depth (fewest sequential transformations) is selected to avoid overfitting.

Stage 4: Guided Solution Generation

For the unseen test input:

The system constructs a structured hint based on the consensus patterns identified in Stage 3.
A final solver (Grok-4) uses these hints, along with the training examples, to generate the output grid.
Self-Consistency: The solver generates multiple candidates ( $N$ ), and the final prediction is determined by cell-wise majority voting.
Ensemble: A Meta-Classifier combines outputs from this Compositional Reasoner and an independent "ARC Lang Solver" to select the best solution for the final submission.

3. Key Contributions

Architectural Separation: Explicitly decoupling perception (symbolic abstraction), hypothesis generation (neural-guided DSL proposal), and verification (symbolic consistency filtering). This reduces hypothesis entropy and prevents the entanglement of perception and reasoning found in end-to-end neural models.
Constrained DSL: The introduction of a 22-pattern "Unit Pattern" library that acts as a "core knowledge" set for ARC tasks, enabling systematic generalization while restricting the combinatorial search space.
Neural-Guided Symbolic Search: Using LLMs not to generate solutions directly, but to propose and rank symbolic hypotheses, effectively bridging the gap between neural perception and symbolic reasoning.
State-of-the-Art Performance: Achieving significant improvements over pure LLM baselines without task-specific fine-tuning or reinforcement learning.

4. Results

The system was evaluated on the ARC-AGI-2 public evaluation set using the pass@2 metric (a task is solved if at least one of two submitted outputs matches the ground truth).

Compositional Reasoner (Standalone): Achieved 24.4%, significantly outperforming pure LLMs (e.g., GPT-5-Pro at 18.3%, Grok-4 at 16.0%).
Meta-Classifier Ensemble: By combining the Compositional Reasoner with the ARC Lang Solver, the system reached 30.8%.
Comparison: The ensemble outperformed other hybrid systems (e.g., J. Berman at 29.4%) and all listed pure LLMs.
Ablation Studies:
- Removing symbolic hints caused a 6.9% drop (24.4% $\to$ 17.5%), proving that structural abstraction is the primary driver of performance.
- Removing self-consistency caused a 3.9% drop (24.4% $\to$ 20.5%), showing that stochastic sampling provides robustness but is secondary to structural bias.
- The ensemble gain (+4.2% over the best individual solver) indicates that the two solvers capture complementary task families.

5. Significance and Conclusion

The paper demonstrates that scale alone is insufficient for fluid intelligence. The results suggest that progress in ARC requires architectural priors that explicitly encode compositional structure.

Generalization vs. Memorization: By separating perception from rule induction and enforcing cross-example consistency, the system achieves systematic generalization without relying on brute-force search or massive test-time sampling.
Efficiency: The approach reduces reliance on expensive sampling-based scaling. The symbolic preprocessing adds negligible latency compared to LLM inference, while the structural bias provides the majority of the accuracy gains.
Future Directions: The authors note that while the 22-pattern DSL covers many tasks, gaps remain for deeply compositional or relational reasoning. Future work should expand the transformation library and integrate more efficient global constraint solvers.

The code is open-sourced at github.com/CoreThink-AI/arc-agi-2-reasoner.