ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

The paper introduces ShapeCodeBench, a renewable synthetic benchmark designed to evaluate models on perception-to-program reconstruction tasks by requiring them to generate executable drawing programs from rendered images, revealing that while current multimodal models preserve foreground structure, they still struggle with exact match due to minor parameter errors.

Original authors: Shivam Kumar

Published 2026-05-13✓ Author reviewed
📖 5 min read🧠 Deep dive

Original authors: Shivam Kumar

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery. You are handed a finished drawing—a black-and-white picture of circles and squares on a white background. Your job isn't just to describe the picture; you have to write the exact computer code that a robot would use to draw that picture from scratch.

This is the challenge of ShapeCodeBench, a new "test" created by researcher Shivam Kumar to see how good modern AI models are at this specific task.

Here is a breakdown of how it works, why it's special, and what the results tell us, using simple analogies.

1. The Game: "Reverse-Engineering a Drawing"

Think of the AI models as students taking a very strict exam.

  • The Input: The student sees a picture (a "raster") of black shapes on a white canvas.
  • The Task: The student must type out a program using a tiny, specific language (a "DSL") that tells a computer how to draw those exact shapes.
  • The Rules: The language only has four moves: draw a filled circle, an outline circle, a filled square, or an outline square. The canvas is always 512x512 pixels.
  • The Grading: A computer doesn't just read the student's code; it runs it. It draws the picture again based on the code and compares the new drawing to the original. If even one pixel is in the wrong place, the answer isn't "perfect."

2. Why This Test is Different: The "Infinite Fresh Paper"

Most AI tests use a fixed set of questions (like a standard math test). Once an AI memorizes the answers, the test is useless. This is called "contamination."

ShapeCodeBench is like a magic drawing machine.

  • Every time you want a new test, you turn a crank (a "seed").
  • The machine instantly generates a brand-new, unique set of shapes with different sizes, overlaps, and positions.
  • Because researchers can generate a fresh held-out set from a new seed whenever they want, this reduces exact-instance contamination — the risk that the model has already seen the specific test questions during training.

3. The Difficulty Levels

The test has three levels, like a video game:

  • Easy: A few shapes, far apart, not touching.
  • Medium: More shapes, some getting close or slightly overlapping.
  • Hard: Many shapes, all crammed together, overlapping heavily, and some cut off by the edge of the page.

4. The Contestants

The paper tested two types of "students":

  1. The Old-School Robot (Heuristic): A traditional computer program that looks at the picture, finds blobs of black ink, and guesses, "That's a circle," "That's a square." It's fast and good at simple things but gets confused when shapes overlap.
  2. The Super-AI (Multimodal Models): Two of the world's smartest AI models (Claude Opus 4.7 and GPT-5.5) were asked to look at the picture and write the code. They were tested with different levels of "thinking effort" (like asking them to "think harder" or "take more time").

5. The Results: A Tale of Two Strengths

The results were surprising and showed that neither side is perfect yet.

  • On Easy Levels: The Old-School Robot actually won! It was better at getting the exact code right for simple, non-overlapping shapes. The Super-AIs often got the shapes right but messed up the tiny details (like the radius being off by a few pixels).

    • Analogy: The robot is like a carpenter who can perfectly measure a single, isolated board. The AI is like a creative artist who knows what a chair looks like but struggles to measure the legs to the millimeter.
  • On Hard Levels: When shapes were piled on top of each other, the Old-School Robot got confused and often saw one big blob instead of separate shapes. The Super-AIs retained more of the spatial structure in these harder scenes — especially as measured by foreground IoU (how much the painted regions of the two pictures overlap) — and produced code that captured the general layout of the pile. But neither side mastered hard scenes: even the Super-AIs still struggled to reconstruct the exact pixel-level details.

    • Analogy: The robot sees a pile of laundry and says, "That's a pile." The AI sees the pile and says, "That's a shirt, a sock, and a hat all tangled together," but still struggles to separate them perfectly.
  • The "Perfect Score" Problem: Even the best AI model rarely got a 100% perfect score (where the re-drawn picture matches the original pixel-for-pixel). They usually got the structure right (the right shapes in the right places) but failed on the precision (the exact numbers for size and position).

6. What This Means

The paper concludes that we are not "done" with this problem.

  • The test is not saturated (it's not too easy).
  • Current AI models are great at understanding the big picture (spatial structure) but still struggle with the tiny details (exact parameters).
  • The test provides a clear way to measure progress: as AI gets better, it should start beating the Old-School Robot on easy levels while maintaining its lead on hard levels.

In short, ShapeCodeBench is a fresh, renewable playground where we can see exactly where AI is strong (understanding complex scenes) and where it is still clumsy (measuring precise details).

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →