Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models

This paper introduces Spatial Colour Mixing as a stress test revealing that Vision-Language Models suffer from systematic perceptual weaknesses under structured color distortions where humans remain robust, suggesting that perception-aware preprocessing can significantly improve model reliability.

Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi

Published 2026-03-09
📖 4 min read☕ Coffee break read

The Big Idea: Tricking the AI's Eyes

Imagine you have a super-smart robot that can look at a picture and tell you exactly what it sees. It's great at recognizing cats, dogs, and famous landmarks. But, like a human who has never seen a specific optical illusion, this robot has a weird blind spot.

The researchers in this paper decided to stress-test these robots (called Vision-Language Models or VLMs) by showing them pictures that look like a "glitchy" TV screen. They didn't change the actual object in the photo; they just painted over it with a colorful, striped pattern.

The Analogy: Think of a photo of a cat. Now, imagine someone paints thin, vertical stripes of red, green, and blue over the cat's face.

  • If you stand close up: The cat looks like a mess of colorful lines.
  • If you step back and squint: The colors blend together, and suddenly, you can clearly see the cat again.

Humans can do this "stepping back" trick naturally. The AI, however, gets confused. It sees the stripes and says, "I don't see a cat! I see a dog!" or "I see a painting by Jackson Pollock!" even though the cat is right there.

What Did They Do?

The team created a new "test" called Spatial Colour Mixing. They took real photos and overlaid them with eight different types of colorful patterns (like a grid, stripes, or a checkerboard).

They tested nine different AI models (including popular ones like LLaVA and Qwen) on four types of images:

  1. Animals (Cats, dogs, bears)
  2. Art (Famous paintings)
  3. Landmarks (The Eiffel Tower, Pyramids)
  4. General Questions (A mix of everything)

The Shocking Results

  1. The AI Crumbled: As soon as they added the colorful stripes, the AI's accuracy dropped like a stone. Even a tiny bit of distortion made the AI guess wildly wrong.
  2. Bigger isn't Better: The researchers thought, "Maybe if we make the AI brain bigger, it will be smarter." They tested small, medium, and giant versions of the models. It didn't help. A giant AI was just as confused as a small one when faced with these color tricks.
  3. Humans vs. Robots: They asked 61 humans to look at the same tricked-up pictures. Humans did great! They could still identify the animals easily. The AI, however, was completely lost. This proves that human eyes and AI eyes work in totally different ways.

Why Does This Happen?

The paper suggests that human vision is like a smart editor. When we see a messy, striped image, our brain says, "Okay, ignore the messy lines; let's look at the big shape." We naturally filter out the noise.

The AI, on the other hand, is like a very literal accountant. It counts every single pixel. If the pixels are red and green stripes, it gets stuck on those stripes and forgets the shape of the animal underneath. It doesn't know how to "squint" or "step back."

Can We Fix It?

The researchers tried a clever workaround. Since humans fix this by "stepping back" (which blurs the image slightly), they tried blurring the image or shrinking it down and making it bigger again before showing it to the AI.

  • The Result: It worked! For some types of color tricks, this simple "blur" step helped the AI get the answer right again.
  • The Catch: They also tried giving the AI a "tool" (like a Python code interpreter) to fix the image itself. But the AI didn't know when to use the tool. It kept trying to solve the puzzle with its confused eyes, even when it was failing. It lacked the self-awareness to say, "Hey, this picture looks weird; I should try a different approach."

The Takeaway

This paper isn't saying AI is useless. It's saying that AI is fragile. It can be easily fooled by visual tricks that a toddler would ignore.

To make AI safer and more reliable, we can't just make the models bigger. We need to:

  1. Teach them to "squint": Give them tools to preprocess images (blur them) before looking at them.
  2. Teach them self-awareness: Help them realize when they are confused so they can ask for help or try a different method.

In short: AI sees the pixels; humans see the picture. Until AI learns to see the picture, it will keep getting tripped up by these colorful illusions.