Imagine you are a detective trying to solve a mystery. You are handed a finished drawing—a black-and-white picture of circles and squares on a white background. Your job isn't just to describe the picture; you have to write the exact computer code that a robot would use to draw that picture from scratch.

This is the challenge of ShapeCodeBench, a new "test" created by researcher Shivam Kumar to see how good modern AI models are at this specific task.

Here is a breakdown of how it works, why it's special, and what the results tell us, using simple analogies.

1. The Game: "Reverse-Engineering a Drawing"

Think of the AI models as students taking a very strict exam.

The Input: The student sees a picture (a "raster") of black shapes on a white canvas.
The Task: The student must type out a program using a tiny, specific language (a "DSL") that tells a computer how to draw those exact shapes.
The Rules: The language only has four moves: draw a filled circle, an outline circle, a filled square, or an outline square. The canvas is always 512x512 pixels.
The Grading: A computer doesn't just read the student's code; it runs it. It draws the picture again based on the code and compares the new drawing to the original. If even one pixel is in the wrong place, the answer isn't "perfect."

2. Why This Test is Different: The "Infinite Fresh Paper"

Most AI tests use a fixed set of questions (like a standard math test). Once an AI memorizes the answers, the test is useless. This is called "contamination."

ShapeCodeBench is like a magic drawing machine.

Every time you want a new test, you turn a crank (a "seed").
The machine instantly generates a brand-new, unique set of shapes with different sizes, overlaps, and positions.
Because researchers can generate a fresh held-out set from a new seed whenever they want, this reduces exact-instance contamination — the risk that the model has already seen the specific test questions during training.

3. The Difficulty Levels

The test has three levels, like a video game:

Easy: A few shapes, far apart, not touching.
Medium: More shapes, some getting close or slightly overlapping.
Hard: Many shapes, all crammed together, overlapping heavily, and some cut off by the edge of the page.

4. The Contestants

The paper tested two types of "students":

The Old-School Robot (Heuristic): A traditional computer program that looks at the picture, finds blobs of black ink, and guesses, "That's a circle," "That's a square." It's fast and good at simple things but gets confused when shapes overlap.
The Super-AI (Multimodal Models): Two of the world's smartest AI models (Claude Opus 4.7 and GPT-5.5) were asked to look at the picture and write the code. They were tested with different levels of "thinking effort" (like asking them to "think harder" or "take more time").

5. The Results: A Tale of Two Strengths

The results were surprising and showed that neither side is perfect yet.

On Easy Levels: The Old-School Robot actually won! It was better at getting the exact code right for simple, non-overlapping shapes. The Super-AIs often got the shapes right but messed up the tiny details (like the radius being off by a few pixels).
- Analogy: The robot is like a carpenter who can perfectly measure a single, isolated board. The AI is like a creative artist who knows what a chair looks like but struggles to measure the legs to the millimeter.
On Hard Levels: When shapes were piled on top of each other, the Old-School Robot got confused and often saw one big blob instead of separate shapes. The Super-AIs retained more of the spatial structure in these harder scenes — especially as measured by foreground IoU (how much the painted regions of the two pictures overlap) — and produced code that captured the general layout of the pile. But neither side mastered hard scenes: even the Super-AIs still struggled to reconstruct the exact pixel-level details.
- Analogy: The robot sees a pile of laundry and says, "That's a pile." The AI sees the pile and says, "That's a shirt, a sock, and a hat all tangled together," but still struggles to separate them perfectly.
The "Perfect Score" Problem: Even the best AI model rarely got a 100% perfect score (where the re-drawn picture matches the original pixel-for-pixel). They usually got the structure right (the right shapes in the right places) but failed on the precision (the exact numbers for size and position).

6. What This Means

The paper concludes that we are not "done" with this problem.

The test is not saturated (it's not too easy).
Current AI models are great at understanding the big picture (spatial structure) but still struggle with the tiny details (exact parameters).
The test provides a clear way to measure progress: as AI gets better, it should start beating the Old-School Robot on easy levels while maintaining its lead on hard levels.

In short, ShapeCodeBench is a fresh, renewable playground where we can see exactly where AI is strong (understanding complex scenes) and where it is still clumsy (measuring precise details).

Technical Summary: ShapeCodeBench

Problem Statement

The paper addresses the challenge of perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that, when re-rendered by a deterministic evaluator, produces an identical or near-identical image. While modern multimodal models are increasingly evaluated on image-to-code tasks (e.g., screenshot-to-HTML, structure extraction), existing benchmarks often lack a combination of deterministic execution, render-based scoring, and renewability. Most benchmarks satisfy only one or two of these criteria, and few allow for the regeneration of fresh, uncontaminated test sets without manual annotation. ShapeCodeBench is designed to fill this gap by providing a synthetic, renewable benchmark for inverse graphics over a constrained Domain Specific Language (DSL).

Methodology

1. Benchmark Design

ShapeCodeBench consists of four coupled components:

DSL (Domain Specific Language): A minimal set of four primitives operating on a fixed $512 \times 512$ black-on-white canvas: filled_circle, circle, filled_square, and square. The language supports integer parameters for coordinates, size/radius, and stroke width. The parser is a strict whitelist implementation based on Python's ast module, rejecting imports, loops, and non-integer literals.
Scene Generator: A seeded random number generator (RNG) creates scenes by rejection-sampling candidate shapes. It enforces specific constraints based on three difficulty tiers (Easy, Medium, Hard) regarding shape count, extent (radius/size), stroke width, canvas clipping probability, and bounding-box overlap.
Renderer: Uses the Pillow library to deterministically render the DSL program into an 8-bit grayscale image. The rendering order is preserved, but the binary palette makes scenes order-invariant regarding foreground pixel addition (later shapes cannot erase earlier ones).
Evaluator: Parses the model's predicted program, re-renders it, and compares the resulting raster against the ground truth.

2. Evaluation Metrics

The system reports five primary metrics:

Exact Match: Pixel-perfect equality between target and re-rendered image.
Pixel Accuracy: Fraction of matching pixels.
Foreground IoU: Intersection-over-Union of black pixels.
Parse Success: Whether the program is syntactically valid.
Execution Success: Whether the program renders without error.

3. Experimental Setup

The authors evaluated six systems on a frozen split (eval_v1) of 150 samples (50 per difficulty tier):

Baselines: An "Empty-Program" floor and a "Heuristic-CV" baseline (classical computer vision using connected components, morphological erosion, and area/perimeter ratios to estimate shape parameters).
Multimodal Models:
- Claude Opus 4.7 (1M context): Tested at "high" and "max" reasoning effort.
- GPT-5.5: Tested at "medium" and "extra_high" reasoning effort.
Protocol: All models used zero-shot prompting with strict formatting constraints. No chain-of-thought or few-shot examples were used.

Key Contributions

ShapeCodeBench Release: A complete benchmark suite including the DSL, a safe restricted parser, a seeded scene generator with three difficulty tiers, and a render-based evaluator.
Frozen Evaluation Split (eval_v1): A deterministic set of 150 samples with published SHA-256 hashes for exact reproducibility across platforms.
Renewable Workflow: A mechanism to generate fresh held-out splits from new seeds and score them automatically, mitigating exact-instance contamination without requiring human annotation.
Provider-Agnostic Runner: A tool to record prompts, configurations, raw outputs, and metrics, making evaluations auditable.
Baseline Results: Comprehensive reporting of four multimodal configurations against non-LLM baselines, revealing distinct failure modes and performance gaps.

Results

Overall Performance

Exact Match: The benchmark is far from saturated. The best exact match rate achieved by any multimodal model is 0.027 (GPT-5.5 medium), while the classical heuristic baseline achieves 0.087.
Foreground IoU: Multimodal models significantly outperform the heuristic on this metric. GPT-5.5 (extra_high) achieves a mean foreground IoU of 0.87, retaining most spatial structure.
Parse Success: LLMs achieve high parse success rates (0.97–1.00), with failures primarily due to out-of-range parameters or invalid stroke widths.

Tier-Dependent Crossover

A critical finding is the tier-dependent crossover between the heuristic and LLMs:

Easy Tier: The classical heuristic leads in exact match (0.26) because scenes consist of separated, non-overlapping shapes that connected components can individuate perfectly. Multimodal models struggle here, often missing exact match by small parameter errors (off-by-a-few-pixels).
Medium/Hard Tiers: The heuristic collapses as overlapping shapes fuse into single connected components, preventing individuation. Multimodal models retain spatial structure (high IoU) and can enumerate overlapping shapes, though they still fail to achieve pixel-perfect exact matches due to parameter precision issues under occlusion.

Failure Modes

LLMs: Failures are dominated by "out_of_range" (coordinates/size outside valid bounds) and "invalid_stroke" errors. They also struggle with precise parameter estimation (e.g., exact radius or stroke width) and distinguishing hollow vs. filled shapes when strokes are thin.
Heuristic: Fails to individuate overlapping or clipped shapes, leading to a sharp drop in IoU on harder tiers.

Significance and Claims

The paper positions ShapeCodeBench not as a replacement for existing benchmarks like TurtleBench or Image2Struct, but as a complementary tool that prioritizes control and reproducibility over realism.

Diagnostic Value: The benchmark successfully exposes distinct failure modes: LLMs struggle with precise parameter emission even when they understand the scene structure, while classical CV fails at individuation in complex scenes.
Renewability: By allowing the generation of fresh splits, the benchmark offers a sustainable feedback loop for model development, avoiding the "contamination" issues common in static datasets.
Modest Scope: The authors explicitly state that the current version is a "v1" with deliberate limitations (monochrome palette, four primitives, zero-shot only). They do not claim to have solved the problem of visual program induction but rather provide a rigorous, renewable environment to measure progress and identify specific gaps in perception versus structured code emission.

The paper concludes that while current frontier models show promise in spatial reasoning (high IoU), they remain far from the precision required for exact program reconstruction, and the gap between heuristic performance on simple tasks and LLM performance on complex tasks highlights the need for further research in parameter estimation and occlusion reasoning.

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes