Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

The paper introduces Agent Banana, a hierarchical agentic framework featuring Context Folding and Image Layer Decomposition to enable high-fidelity, multi-turn image editing at native 4K resolution, validated by the new HDD-Bench benchmark.

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are a professional photo editor working on a massive, ultra-high-definition 4K poster for a movie. You need to make tiny, precise changes: "Change the blue bottle to red," then "Add a cat sitting on the table," then "Make the sky sunset-colored but keep the cat exactly where it is."

If you use a standard AI photo editor today, it's like hiring a very enthusiastic but clumsy intern. They might change the bottle to red, but in the process, they accidentally repaint the cat, blur the background, or shrink the whole image so the details look fuzzy. If you ask for a second change, they might forget what you did in the first step, or make the whole picture look "off" because they re-painted the entire image from scratch every time.

Agent Banana is the solution to this problem. It's a new AI system designed to act like a master craftsman with a perfect memory and a set of surgical tools, rather than a clumsy intern.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Clumsy Intern" vs. The "Master"

Current AI editors suffer from three main issues:

  • Over-editing: They change things you didn't ask for (like changing the cat's fur color when you only wanted to change the bottle).
  • Memory Loss: They forget the history of your conversation. If you say "undo that," they might get confused about what "that" was.
  • Resolution Drop: To save time, they often shrink your 4K masterpiece down to a small thumbnail, edit it, and blow it back up. This ruins the fine details (like the texture of wood or fabric).

2. The Solution: Agent Banana's "Brain" and "Hands"

Agent Banana splits the work into two distinct roles, like a Project Manager and a Specialized Technician.

  • The Planner (The Project Manager): This part of the AI reads your vague request ("Make the scene look more like a movie") and breaks it down into a logical checklist. It doesn't touch the image; it just makes the plan.
  • The Executor (The Technician): This part actually does the work. It picks the right tools and applies them.

3. The Secret Sauce: Two Magic Tricks

Trick A: "Context Folding" (The Perfect Memory)

Imagine you are writing a story. If you keep writing the whole story from page 1 every time you add a new sentence, you'll run out of paper (or computer memory) very quickly.

  • How it works: Agent Banana doesn't re-read the whole history. Instead, it "folds" the past into a neat, structured summary. It remembers, "Okay, we changed the bottle to red, and the cat is now on the table," without needing to re-process the entire image history every time. This allows it to handle long, complex editing sessions without getting confused or forgetting earlier steps.

Trick B: "Image Layer Decomposition" (The Surgical Scalpel)

This is the most important part for high-quality editing.

  • The Old Way: Imagine you want to change the color of a shirt on a person. The old AI takes the entire photo, blurs it slightly, paints the shirt, and puts the whole photo back together. The background gets slightly blurry every time you do this.
  • The Agent Banana Way: It uses a "surgical scalpel." It identifies only the shirt, cuts it out onto a separate transparent sheet (a "layer"), changes the color on that sheet, and then glues it back perfectly. The background, the person's face, and the table remain untouched and crisp.
  • The Result: You can edit a 4K image at its full, native resolution without ever losing a single pixel of detail.

4. The New Test: HDD-Bench

To prove this works, the creators built a new test called HDD-Bench.

  • Think of previous tests as asking a student to solve a single math problem.
  • HDD-Bench is like giving a student a complex, multi-step engineering project where they have to build a bridge, then add a road, then paint it, all while keeping the foundation intact. It tests if the AI can handle long chains of commands without messing up the parts it wasn't supposed to touch.

Why This Matters

In the real world, professional photographers and designers don't just want "good enough" edits. They need precision.

  • If you are editing a movie scene, you can't have the background drift or blur after 10 different edits.
  • If you are fixing a product photo, you can't have the texture of the product disappear.

Agent Banana bridges the gap between "fun AI toys" and "professional tools." It allows you to have a conversation with an AI, make complex changes step-by-step, and get back a result that looks like it was edited by a human expert with a high-end camera and Photoshop, but done instantly.

In short: Agent Banana is the AI editor that finally understands that sometimes you just want to change the color of a bottle without accidentally turning the whole world into a painting.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →