Teaching an Agent to Sketch One Part at a Time

Imagine you want to teach a robot to draw a picture of a chair, but instead of handing it a blank canvas and saying "Draw a chair," you want to guide it like a human teacher: "First, draw the legs. Good. Now, draw the seat. Perfect. Finally, add the backrest."

That is exactly what this paper, "Teaching an Agent to Sketch One Part at a Time," is about. The researchers built an AI agent that doesn't just spit out a finished drawing all at once; it learns to build a sketch piece by piece, just like a human artist.

Here is the breakdown of how they did it, using some everyday analogies:

1. The Problem: The "All-or-Nothing" Artist

Most AI drawing tools today work like a magic 8-ball. You ask for a "chair," and poof, a full image appears. If the legs look weird, you have to ask the whole thing to start over. It's like ordering a pizza and hoping the chef gets the pepperoni right; if they don't, you can't just ask them to move the pepperoni—you have to order a whole new pizza.

Existing methods that try to draw step-by-step are often limited. They might draw simple stick figures or icons, but they struggle to create detailed, professional-looking vector art (the kind of crisp, scalable graphics used in design software).

2. The Solution: The "Part-by-Part" Apprentice

The authors created a new AI agent that acts like a skilled apprentice. Instead of guessing the whole picture, it follows a recipe:

Read the instruction: "Draw a chair."
Break it down: "Okay, I need legs, a seat, and a back."
Draw one part: It draws the legs.
Look at the canvas: It sees the legs it just drew.
Draw the next part: It draws the seat, making sure it connects to the legs.
Repeat: It keeps going until the chair is done.

This allows for local editing. If the seat looks wrong, you can tell the AI, "Erase the seat and try a different shape," without destroying the legs you already liked.

3. The Secret Sauce: A New "Textbook" (ControlSketch-Part)

To teach the AI this skill, they needed a massive textbook. But there was a problem: no one had a dataset of professional sketches broken down into labeled parts (e.g., "these lines are the legs," "these lines are the seat").

The Analogy: Imagine trying to teach a student to write an essay, but you only have a pile of finished essays with no outlines or chapter breaks. It's hard to learn how to write.

The Fix: The team built a smart annotation pipeline. They used a powerful AI (a Vision-Language Model) to look at existing sketches and automatically:

Deconstruct the image into parts (like separating a car into wheels, doors, and windows).
Label the lines (paths) that make up each part.
Critique its own work to make sure the labels are accurate.

They did this for thousands of sketches, creating a new dataset called ControlSketch-Part. It's like turning a pile of finished essays into a library of essays with detailed outlines and highlighted sections.

4. The Training: "Practice" and "Coaching"

Training the AI happened in two distinct phases, similar to how a human learns a sport:

Phase 1: Supervised Fine-Tuning (The Drill Sergeant)
The AI is shown the "textbook" (the ControlSketch-Part dataset). It learns the rules: "When I see the word 'legs', I must output these specific lines." It learns the format and how to draw one part at a time. This is like memorizing the rules of the game.
Phase 2: Reinforcement Learning with Process Rewards (The Coach)
This is the innovative part. In the first phase, the AI only saw "perfect" intermediate steps (like looking at a finished puzzle and seeing the pieces already placed correctly). But in real life, the AI has to draw the first piece, then the second, and it might make a mistake early on.

To fix this, the researchers used a technique called GRPO (Group Relative Policy Optimization).
- The Analogy: Imagine a coach watching a player practice. Instead of only grading the player at the end of the game, the coach gives feedback during the game. "Good pass! But your footwork on that second step was sloppy."
- How it works: The AI draws a sketch in multiple steps. After every single step (every part added), the system checks: "Does this partial drawing look like the real thing so far?" If yes, it gets a reward. If the drawing starts to look weird, it gets a penalty. This "dense feedback" teaches the AI to correct itself as it goes, rather than waiting until the end to realize it failed.

5. The Result: A Controllable, Creative Artist

The results are impressive. The new agent:

Draws better: It creates complex, smooth, and realistic vector sketches that look much more professional than previous methods.
Listens better: It follows text instructions more accurately.
Is editable: You can ask it to "change the backrest to be round" or "remove the arms," and it will do exactly that without messing up the rest of the drawing.

Summary

In short, the authors realized that to teach an AI to draw like a human, you can't just show it the final picture. You have to teach it how to build the picture, one brick at a time. They built a new "textbook" of broken-down sketches and a "coaching system" that gives feedback at every single step. The result is an AI that doesn't just generate art; it constructs it, allowing for a level of control and creativity that was previously impossible.

Teaching an Agent to Sketch One Part at a Time

1. The Problem: The "All-or-Nothing" Artist

2. The Solution: The "Part-by-Part" Apprentice

3. The Secret Sauce: A New "Textbook" (ControlSketch-Part)

4. The Training: "Practice" and "Coaching"

5. The Result: A Controllable, Creative Artist

Summary

1. Problem Statement

2. Methodology

A. Automated Part Annotation Pipeline

B. Training Framework: SFT + Multi-Turn Process-Reward RL

3. Key Contributions

4. Results

5. Significance

Teaching an Agent to Sketch One Part at a Time

1. The Problem: The "All-or-Nothing" Artist

2. The Solution: The "Part-by-Part" Apprentice

3. The Secret Sauce: A New "Textbook" (ControlSketch-Part)

4. The Training: "Practice" and "Coaching"

5. The Result: A Controllable, Creative Artist

Summary

1. Problem Statement

2. Methodology

A. Automated Part Annotation Pipeline

B. Training Framework: SFT + Multi-Turn Process-Reward RL

3. Key Contributions

4. Results

5. Significance

More like this

When both Grounding and not Grounding are Bad -- A Partially Grounded Encoding of Planning into SAT (Extended Version)

Learning to Disprove: Formal Counterexample Generation with Large Language Models

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management