DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes is a Large Vision-Language Model that leverages reinforcement learning to natively learn "thinking with images" through active perception, enabling it to strategically ground reasoning in visual information without requiring pre-collected reasoning data and achieving significant improvements in perception, reasoning, and grounding tasks.

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu

Published 2026-03-03
📖 5 min read🧠 Deep dive

DeepEyes: Teaching AI to "Look Before It Leaps"

Imagine you are taking a difficult math test, but instead of just reading the question, you are allowed to use a magnifying glass to inspect the diagram, zoom in on tiny numbers, or even rotate the paper to see it from a different angle. Most current AI models are like students who are forced to answer only by reading the text, even if the answer is hidden in a tiny detail of a picture. They often guess based on what they think the text should say, rather than what the image actually shows.

DeepEyes is a new AI model that learns to stop guessing and start looking. It learns to "think with images" by actively zooming in, cropping, and examining parts of a picture just like a human would.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Blind" Genius

Current AI models (Vision-Language Models) are incredibly smart at reading and talking. They can describe a picture perfectly. But when it comes to reasoning about the picture, they often act like a person reading a map in the dark. They rely on their memory of what maps usually look like rather than actually looking at the specific details in front of them.

  • The Analogy: Imagine a detective trying to solve a crime. A standard AI is like a detective who reads the police report and guesses who the culprit is based on the description. DeepEyes is the detective who puts on a magnifying glass, walks over to the window, and actually looks for the fingerprints.

2. The Solution: Active Perception (The "Magnifying Glass" Skill)

DeepEyes doesn't just stare at the whole image. It has a special ability called Active Perception.

  • How it works: As the AI thinks through a problem, it can decide, "Wait, I can't see the clock clearly from here. I need to zoom in on the bookshelf."
  • The Magic: It generates a "crop" (a zoomed-in picture) of that specific area, adds it to its memory, and continues thinking. It can do this multiple times, building a chain of thought that mixes text and zoomed-in images.

3. How It Learned: The "Video Game" Method

Usually, to teach an AI to do something complex, humans have to write out thousands of examples of "Step 1: Look here, Step 2: Say this." This is slow and expensive.

DeepEyes learned differently. The researchers used Reinforcement Learning, which is like training a dog or a video game character:

  • No Cheat Sheets: They didn't give the AI a pre-written script.
  • The Reward System: They let the AI try to solve problems on its own.
    • If it guessed the answer without looking, it got a low score.
    • If it guessed correctly after zooming in and checking the details, it got a bonus reward.
  • The Result: Over time, the AI realized that "looking closer" was the key to winning. It learned to use its "magnifying glass" instinctively, just like a human learns to squint when reading small print.

4. The Evolution: From Clumsy to Master

The paper describes a fascinating journey of how the AI learned:

  1. Stage 1 (The Clumsy Explorer): At first, the AI was confused. It would zoom in randomly, like a toddler pressing every button on a remote control. It didn't know where to look.
  2. Stage 2 (The Over-Enthusiast): Then, it realized zooming helped! But it went too far. It would zoom in on everything, asking for too many pictures, wasting time.
  3. Stage 3 (The Master Detective): Finally, it became efficient. It learned to look at the whole picture, realize exactly which tiny detail was missing, zoom in once on that specific spot, and solve the problem. It learned to be strategic.

5. Why This Matters

This isn't just about solving math problems or finding clocks in living rooms. It changes how AI interacts with the world:

  • Fewer Hallucinations: AI often "hallucinates" (makes things up) because it relies on text patterns. By forcing the AI to look at the visual evidence, it stops making things up.
  • Better Reasoning: It can handle complex charts, tiny text in high-resolution photos, and tricky logic puzzles that require comparing different parts of an image.
  • Human-Like Thinking: It mimics how humans solve visual problems: we don't just stare; we scan, we focus, and we verify.

In a Nutshell

DeepEyes is an AI that learned to stop guessing and start investigating. Instead of being a passive reader of images, it became an active explorer, using a "magnifying glass" to find the truth hidden in the details. And the best part? It taught itself how to do this through trial and error, without needing humans to write out a manual for every single step.