GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

GoT-R1 is a novel framework that leverages reinforcement learning with a dual-stage multi-dimensional reward system to enhance the semantic-spatial reasoning capabilities of multimodal large language models, significantly improving their ability to generate images from complex prompts involving precise object relationships and attributes.

Original authors: Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu

Published 2026-04-14
📖 4 min read☕ Coffee break read

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are an artist who has just been hired to paint a very specific scene based on a client's description. The client says, "Paint a blue butterfly sitting on the left side of a red candle."

In the past, AI artists (like the ones we have today) were like improvisational jazz musicians. They heard the words and immediately started painting. They were great at making things look realistic, but if the prompt was complex, they often got the details wrong. They might paint the butterfly on the right side, or make it green instead of blue, because they were just guessing the layout as they went along.

Then, researchers introduced a system called GoT (Generation Chain-of-Thought). This was like giving the artist a strict script to follow before painting. The AI would first write down a plan: "Okay, I need a red candle at coordinates (100, 100) and a blue butterfly at (50, 50)." Then, it would paint based on that script. This helped, but the script was written by humans using fixed templates. The AI couldn't think outside the box; if the script was slightly off, the painting was still wrong.

Enter GoT-R1.

Think of GoT-R1 as taking that artist and putting them through a high-intensity "Trial and Error" boot camp using a very smart, critical art critic (an AI called an MLLM).

Here is how it works, step-by-step:

1. The "Brainstorming" Phase (Reinforcement Learning)

Instead of just following a script, the AI is told: "Here is the prompt. Now, try to write 16 different plans (reasoning chains) for how to paint this."

Some plans will be terrible (butterfly on the wrong side). Some will be okay. One might be perfect.

2. The "Critic" Phase (The Reward System)

This is where the magic happens. In previous systems, the AI only got a grade at the very end: "Did the final painting look good?"

GoT-R1 introduces a Dual-Stage Critic. The smart AI critic looks at the process in two ways:

  • The Plan Check: Does the written plan actually match what the client asked for? (e.g., Did the plan say "left" when the client said "left"?)
  • The Execution Check: Did the final painting match the plan? (e.g., If the plan said "left," did the paint actually go on the left?)

The critic gives a score for every single step. If the AI writes a plan that puts the butterfly on the right, it gets a low score immediately, even before it paints a single pixel. If the plan is good but the painting is messy, it gets a low score there too.

3. The "Self-Improvement" Phase

The AI looks at all 16 attempts. It sees which ones got the highest scores from the critic. It then learns: "Ah! When I put the butterfly on the left and describe the candle first, I get a high score. When I mix them up, I get a low score."

It uses a method called GRPO (Group Relative Policy Optimization). Imagine a group of students taking a test. Instead of just telling the teacher "I got an A," the teacher compares all the students. The AI learns by seeing which strategies worked better than the others in the group, rather than just trying to hit a perfect score.

Why is this a big deal?

  • From Scripted to Creative: Before, the AI was like an actor reading a script word-for-word. Now, it's like a method actor who understands the intent of the scene and figures out the best way to act it out on its own.
  • The "Visual" Advantage: One cool trick the researchers used is that the AI critic is bad at reading numbers (like "x=100, y=200"). So, they turned the numbers into visual boxes on a blank canvas. It's like showing the critic a drawing of where the objects should be, rather than a list of numbers. The critic can then "see" if the plan makes sense spatially.

The Result

The paper shows that GoT-R1 is much better at following complex instructions. If you ask for "a cat on a dog's head, next to a tree," the old AI might put the cat next to the dog. GoT-R1 understands the spatial relationships and the attributes (colors, shapes) much better because it learned to think before it draws, and it learned how to think by being graded on both its thinking and its drawing.

In short: GoT-R1 teaches AI to stop guessing and start planning, using a smart critic to grade both the plan and the final artwork, resulting in images that actually look like what you asked for.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →