Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

This paper introduces Bongard-RWR+, a large-scale dataset of 5,400 fine-grained real-world Bongard Problems generated via a vision-language model pipeline, which reveals that while state-of-the-art vision-language models can recognize coarse visual concepts, they consistently struggle with the abstract reasoning required for fine-grained distinctions.

Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you are playing a game of "Spot the Difference," but instead of finding the one odd picture out, you have to figure out the secret rule that separates two groups of pictures.

This is the essence of Bongard Problems, a classic brain teaser invented in 1970.

  • The Setup: You see two boxes. The left box has six pictures; the right box has six pictures.
  • The Goal: Figure out what the pictures in the left box have in common that the pictures in the right box don't.
  • The Catch: The rule is usually abstract. It's not "these are dogs and those are cats." It's something like "the left side has shapes with sharp corners, and the right side has shapes with smooth curves."

The Problem with Old Games

For a long time, these puzzles were drawn with simple black-and-white lines (like stick figures). While good for testing logic, they didn't feel like the real world. Later, researchers tried using real photos, but the rules were too easy (e.g., "Left: A person driving a car. Right: A person walking").

Then came Bongard-RWR, which tried to use real photos to represent those tricky, abstract rules. But there was a snag: it was built by hand, so it only had 60 puzzles. That's like trying to test a new car engine by only driving it for 10 minutes. It's not enough data to know if the engine is truly reliable.

The Solution: Bongard-RWR+ (The "AI Factory")

This paper introduces Bongard-RWR+, a massive upgrade. The researchers built a "factory" that uses AI to create 5,400 new puzzles.

Here is how their factory works, step-by-step:

  1. The Translator (Pixtral-12B): They take an old, simple puzzle and ask an AI to describe the pictures in plain English.
  2. The Creative Writer (Text-to-Text AI): They ask the AI to rewrite those descriptions in 15 different ways, keeping the rule the same but changing the scene.
    • Original: "A tall building."
    • Rewrite 1: "A skyscraper piercing the clouds."
    • Rewrite 2: "A modern glass tower in a city."
    • Rewrite 3: "A high-rise apartment block."
  3. The Painter (Flux.1-dev): They feed these new descriptions into an image generator to create brand-new, realistic photos that follow the rule.
  4. The Quality Control (Humans): Humans look at the new photos. If a photo accidentally breaks the rule (e.g., the "tall building" photo has a tiny house in the corner that confuses the rule), it gets thrown in the trash.

The result? A giant library of 5,400 puzzles where the rules are abstract, but the images look like real life.

The Big Test: Can AI Solve It?

The researchers used this new library to test the smartest AI vision models available today (like InternVL, Qwen, and LLaVA). They asked the AIs three types of questions:

  1. The Guess: "Does this new photo belong on the Left or the Right?"
  2. The Multiple Choice: "Here are 16 possible rules. Which one is the secret rule?"
  3. The Explanation: "Tell me in your own words what the rule is."

The Shocking Results

The findings were a bit of a reality check for the AI world:

  • The "Big Picture" AI: The AIs are pretty good at spotting obvious things. If the rule is "Big things vs. Small things," they get it right most of the time.
  • The "Fine Detail" AI: When the rule gets subtle—like "The lines curve slightly to the left" vs. "The lines curve slightly to the right"—the AIs start to hallucinate. They get confused, guess randomly, or make up rules that don't exist.
  • The "Explanation" Failure: When asked to explain the rule in text, the AIs struggled miserably. They could sometimes guess the right side, but they couldn't articulate why. It's like a student who can pick the right answer on a multiple-choice test but can't explain the math behind it.

Why This Matters

Think of current AI as a very observant tourist. They can tell you, "Oh, that's a red car," or "That's a tree." But they struggle to be a detective. They can't look at a scene, ignore the noise, and deduce the hidden, abstract logic connecting the dots.

Bongard-RWR+ is a new, harder gym for AI to train in. It proves that while our AI models are getting huge and powerful, they still lack the deep, human-like ability to reason about abstract patterns in the real world. Until they can solve these puzzles, they aren't truly "thinking" the way we do; they are just pattern-matching.

In short: We built a massive, AI-made puzzle book to test our AI's brain. The test showed that while our AI is smart, it's still not as good at abstract reasoning as a human child.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →