Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Imagine you are playing a game of "Spot the Difference," but instead of finding the one odd picture out, you have to figure out the secret rule that separates two groups of pictures.

This is the essence of Bongard Problems, a classic brain teaser invented in 1970.

The Setup: You see two boxes. The left box has six pictures; the right box has six pictures.
The Goal: Figure out what the pictures in the left box have in common that the pictures in the right box don't.
The Catch: The rule is usually abstract. It's not "these are dogs and those are cats." It's something like "the left side has shapes with sharp corners, and the right side has shapes with smooth curves."

The Problem with Old Games

For a long time, these puzzles were drawn with simple black-and-white lines (like stick figures). While good for testing logic, they didn't feel like the real world. Later, researchers tried using real photos, but the rules were too easy (e.g., "Left: A person driving a car. Right: A person walking").

Then came Bongard-RWR, which tried to use real photos to represent those tricky, abstract rules. But there was a snag: it was built by hand, so it only had 60 puzzles. That's like trying to test a new car engine by only driving it for 10 minutes. It's not enough data to know if the engine is truly reliable.

The Solution: Bongard-RWR+ (The "AI Factory")

This paper introduces Bongard-RWR+, a massive upgrade. The researchers built a "factory" that uses AI to create 5,400 new puzzles.

Here is how their factory works, step-by-step:

The Translator (Pixtral-12B): They take an old, simple puzzle and ask an AI to describe the pictures in plain English.
The Creative Writer (Text-to-Text AI): They ask the AI to rewrite those descriptions in 15 different ways, keeping the rule the same but changing the scene.
- Original: "A tall building."
- Rewrite 1: "A skyscraper piercing the clouds."
- Rewrite 2: "A modern glass tower in a city."
- Rewrite 3: "A high-rise apartment block."
The Painter (Flux.1-dev): They feed these new descriptions into an image generator to create brand-new, realistic photos that follow the rule.
The Quality Control (Humans): Humans look at the new photos. If a photo accidentally breaks the rule (e.g., the "tall building" photo has a tiny house in the corner that confuses the rule), it gets thrown in the trash.

The result? A giant library of 5,400 puzzles where the rules are abstract, but the images look like real life.

The Big Test: Can AI Solve It?

The researchers used this new library to test the smartest AI vision models available today (like InternVL, Qwen, and LLaVA). They asked the AIs three types of questions:

The Guess: "Does this new photo belong on the Left or the Right?"
The Multiple Choice: "Here are 16 possible rules. Which one is the secret rule?"
The Explanation: "Tell me in your own words what the rule is."

The Shocking Results

The findings were a bit of a reality check for the AI world:

The "Big Picture" AI: The AIs are pretty good at spotting obvious things. If the rule is "Big things vs. Small things," they get it right most of the time.
The "Fine Detail" AI: When the rule gets subtle—like "The lines curve slightly to the left" vs. "The lines curve slightly to the right"—the AIs start to hallucinate. They get confused, guess randomly, or make up rules that don't exist.
The "Explanation" Failure: When asked to explain the rule in text, the AIs struggled miserably. They could sometimes guess the right side, but they couldn't articulate why. It's like a student who can pick the right answer on a multiple-choice test but can't explain the math behind it.

Why This Matters

Think of current AI as a very observant tourist. They can tell you, "Oh, that's a red car," or "That's a tree." But they struggle to be a detective. They can't look at a scene, ignore the noise, and deduce the hidden, abstract logic connecting the dots.

Bongard-RWR+ is a new, harder gym for AI to train in. It proves that while our AI models are getting huge and powerful, they still lack the deep, human-like ability to reason about abstract patterns in the real world. Until they can solve these puzzles, they aren't truly "thinking" the way we do; they are just pattern-matching.

In short: We built a massive, AI-made puzzle book to test our AI's brain. The test showed that while our AI is smart, it's still not as good at abstract reasoning as a human child.

1. Problem Definition

The paper addresses the limitations of current Abstract Visual Reasoning (AVR) benchmarks, specifically Bongard Problems (BPs).

The Challenge: BPs require models to identify an abstract rule separating two sets of images (Left vs. Right) and articulate it in natural language. This tests few-shot learning, contextual reasoning, and bi-modal (visual-textual) capabilities.
Existing Limitations:
- Synthetic BPs (e.g., Bongard-LOGO): Use simple black-and-white drawings. While they test abstract reasoning, they lack the complexity of real-world scenes and do not challenge models on fine-grained visual features found in natural images.
- Real-World BPs (e.g., Bongard HOI, OpenWorld): Use real images but often rely on coarse-grained concepts (e.g., "person driving a car") that are easily identifiable via high-level features, reducing the reasoning difficulty.
- Bongard-RWR (Previous Work): Attempted to bridge this gap by using real images to represent abstract concepts from synthetic BPs. However, it was manually constructed, limiting the dataset to only 60 instances, which is insufficient for robust statistical evaluation.
Goal: Create a large-scale, diverse dataset of Bongard Problems using real-world-like images that represent fine-grained abstract concepts, enabling rigorous evaluation of Vision-Language Models (VLMs).

2. Methodology: The Bongard-RWR+ Pipeline

The authors introduce Bongard-RWR+, a dataset of 5,400 instances generated via a semi-automated pipeline leveraging Vision-Language Models (VLMs) and Text-to-Image (T2I) models.

The Generation Pipeline (Fig. 3):

Source: Starts with 60 manually curated matrices from the original Bongard-RWR dataset.
Image-to-Text (I2T): Uses Pixtral-12B to generate paired positive and negative textual descriptions for each image based on the side's abstract concept (e.g., "Arrows pointing up" vs. "Arrows pointing down").
Text Augmentation (T2T): Uses a T2T model to expand each positive description into 15 diverse variations ( $N=15$ ) that preserve the core concept but alter irrelevant details (e.g., changing the object type or background).
Text-to-Image (T2I): Uses Flux.1-dev to synthesize new 512x512 images from the augmented descriptions, guided by the negative prompts to avoid the opposing concept.
Human Verification: Generated images undergo manual review by two expert annotators. Images failing to faithfully reflect the intended concept or introducing elements from the opposing side are discarded.
- Result: Approximately 30.2% of generated images were filtered out, ensuring high data quality.
Matrix Construction: The pipeline constructs new matrices by selecting subsets of verified images to maximize visual diversity (minimizing intra-set cosine similarity of ViT embeddings). Each source BP generates 100 new problem instances.

Dataset Variants:

Bongard-RWR+/GS: Grayscale versions to test if color is a necessary cue.
Bongard-RWR+/LP: Variants with varying numbers of examples per side ( $P=2$ to $6$) to study few-shot learning dynamics.
Bongard-RWR+/TVT: Train/Validation/Test splits for supervised learning baselines.

3. Experimental Setup

The authors evaluated four state-of-the-art open-source VLMs:

InternVL2.5 78B
Qwen2-VL 72B
LLaVA-Next 110B
MiniCPM-o 2.6 8B

Task Formulations:

Binary Classification:
- I1S/I2S: Classify a single test image or a pair of images into Left/Right sides.
- D1S/D2S: Classify based on text descriptions of the images (generated by an I2T model).
Multiclass Classification (Concept Selection - CS): Select the correct abstract concept from a set of $K$ candidates ( $K \in \{2, 4, 8, 16\}$ ).
Free-Form Generation (Concept Generation - CG): Generate a natural language description of the underlying concept.

Baselines:

Similarity Classifier (SC): A non-parametric baseline using ViT-L/14 embeddings and max-distance heuristics.
Supervised Learning: MLP, WReN, and SNAIL models trained on the dataset splits.

4. Key Results

A. Performance on Fine-Grained Concepts

General Struggle: Even the largest models (InternVL2.5 78B) struggle significantly. In the Concept Selection (CS) task with $K=16$ distractors, the best model achieved only 57% accuracy (vs. 91% for $K=2$ ).
Coarse vs. Fine: Models perform reasonably well on high-level, coarse-grained concepts (e.g., Size, Count, Shape) but fail on fine-grained concepts requiring subtle geometric reasoning (e.g., Contour, Rotation, Angle), where accuracy drops below 50%.
Classification Tasks: In binary classification (I1S/I2S), most models perform at or near random chance (approx. 0.50). Notably, the simple Similarity Classifier (SC) outperformed all VLMs, suggesting VLMs fail to robustly identify the separating concept and rely on superficial features.

B. Impact of Modality and Input

Text vs. Image: Performance generally improves when using text descriptions (D1S/D2S) as input rather than raw images. This suggests that explicit captioning helps ground the reasoning, though performance remains modest.
Model Size: Performance scales with model size, but even 110B parameter models (LLaVA) do not significantly outperform smaller 7B/8B models in specific reasoning tasks, indicating that scale alone is insufficient for AVR.
Color: Removing color (Grayscale variant) did not hurt performance and sometimes improved it, confirming that the abstract concepts are structural rather than color-dependent.

C. Few-Shot Learning (Demonstrations)

Increasing the number of examples per side ( $P$ ) generally improves performance for larger models (e.g., InternVL2.5), but smaller models (MiniCPM) show inconsistent gains, suggesting they cannot effectively leverage additional demonstrations.

D. Generated vs. Real Images

A statistical analysis showed a strong correlation ( $r > 0.99$ ) between model performance on the generated Bongard-RWR+ images and the original real-world Bongard-RWR images. This validates that synthetic, real-world-like images are effective proxies for evaluating AVR capabilities.

E. Concept Generation

Models performed poorly on the Concept Generation (CG) task, achieving very low BLEU/ROUGE scores. They struggle to articulate the abstract rules in natural language, often describing surface features rather than the underlying logic.

5. Key Contributions

Bongard-RWR+ Dataset: A large-scale benchmark of 5,400 Bongard Problems using generated real-world images to represent fine-grained abstract concepts, overcoming the scalability limits of previous manual datasets.
Semi-Automated Pipeline: A novel workflow combining I2T, T2T, and T2I models with human-in-the-loop verification to generate high-quality, concept-aligned synthetic data.
Comprehensive Evaluation: Extensive benchmarking of SOTA VLMs across diverse task formulations (classification, selection, generation), revealing a significant gap between current AI and human-level abstract reasoning.
Empirical Findings: Demonstrated that while VLMs can recognize coarse concepts, they consistently fail at fine-grained visual reasoning (e.g., spatial relationships, contour analysis), and that current models rely heavily on superficial features rather than deep structural understanding.

6. Significance and Future Directions

Benchmarking AVR: Bongard-RWR+ provides a rigorous testbed for the next generation of multimodal reasoning models, moving beyond simple object recognition to complex pattern abstraction.
Limitations of Current VLMs: The results highlight that scaling parameters alone does not solve abstract reasoning problems; models need better integration of visual perception and logical deduction.
Scalability: The pipeline proves that high-quality reasoning datasets can be generated at scale, provided human verification is maintained to filter out hallucinations.
Ethics: The authors acknowledge potential demographic biases in the T2I model (Flux.1-dev) and note that while the concepts are abstract, the generated images may reflect societal biases (e.g., racial homogeneity). They call for future work to mitigate these biases in generative pipelines.

In conclusion, Bongard-RWR+ establishes that current VLMs are not yet capable of robust abstract visual reasoning in real-world contexts, particularly when fine-grained distinctions are required, and offers a scalable path forward for developing and evaluating future reasoning systems.

Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

The Problem with Old Games

The Solution: Bongard-RWR+ (The "AI Factory")

The Big Test: Can AI Solve It?

The Shocking Results

Why This Matters

1. Problem Definition

2. Methodology: The Bongard-RWR+ Pipeline

3. Experimental Setup

4. Key Results

5. Key Contributions

6. Significance and Future Directions

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks