FireRed-Image-Edit-1.0 Technical Report

Imagine you have a magical photo editor that doesn't just use filters, but actually understands your requests like a human artist. You can say, "Make the dog wear a tuxedo and turn the background into a snowy mountain," and it does exactly that without messing up the rest of the picture.

This paper introduces FireRed-Image-Edit, a new AI model designed to be the ultimate "digital art assistant." The team behind it (from Xiaohongshu) didn't just throw more computing power at the problem; they built a smarter, more efficient system.

Here is the story of how they built it, explained in everyday terms:

1. The Ingredients: Cooking a 100-Million-Recipe Feast

To teach an AI how to edit photos, you need a massive library of examples.

The Problem: Most AI models are trained on messy data, like a library where books are torn, pages are missing, or the stories don't make sense.
The FireRed Solution: They gathered 1.6 billion raw image examples (like buying a massive warehouse of ingredients). But instead of cooking with everything, they acted like a super-chef.
- They washed the ingredients (removed duplicates and bad photos).
- They chopped them perfectly (labeled them with precise instructions).
- They selected only the top 100 million high-quality recipes.
- The Secret Sauce: They made sure the library was balanced. They didn't just teach the AI how to create pictures from scratch; they taught it how to change existing pictures, ensuring it knows when to add, remove, or swap things without ruining the original vibe.

2. The Classroom: Teaching the AI to Listen

Once they had the data, they had to teach the model. They used a three-step training process, like a student going from elementary school to a PhD.

Step 1: Pre-training (The General Knowledge Phase): The AI reads millions of books and looks at millions of photos to understand the world. It learns what a "cat" looks like, what "sunset" means, and how light works.
Step 2: Fine-Tuning (The Specialized Internship): Now, the AI learns specific tasks. It practices following instructions like "make the sky blue" or "remove the trash can." They used a clever trick called "Stochastic Instruction Alignment."
- Analogy: Imagine a teacher giving a student a list of ingredients: "Flour, Eggs, Sugar." Then, the teacher shuffles the list and says, "Sugar, Flour, Eggs." The student must still bake the same cake. This forces the AI to understand the meaning of the words, not just their order.
Step 3: Reinforcement Learning (The Critic's Review): This is where the AI gets a "taste test." They show the AI two versions of an edited photo: one good, one bad. The AI learns to prefer the good one.
- The "Anti-Hack" Trick: Sometimes, AI tries to cheat. For example, if asked to write text on a sign, it might write giant, blurry letters just to trick the system into thinking it did the job. The team invented a "Layout-Aware Reward" to catch this. It's like a strict editor who checks not just what is written, but where it is written and if it fits the picture.

3. The Safety Net: Keeping the Face (and Identity) Intact

One of the hardest things in photo editing is changing a person's clothes without changing their face.

The Problem: Old AI models often turned a specific person into a generic "person" or gave them a weird, plastic look.
The FireRed Solution: They added a "Consistency Loss" (a safety net).
- Analogy: Imagine you are painting a portrait. You are allowed to change the hat and the jacket, but you must keep the face exactly the same. The AI has a "security guard" that constantly checks: "Is this still the same person?" If the AI starts drifting, the guard pulls it back. This ensures that when you change a model's outfit, it's still that model.

4. The Exam: REDEdit-Bench

How do you know if the AI is actually good? You can't just ask, "Did you do it?" You need a real test.

The team built a new exam called REDEdit-Bench.
It has 1,673 real-world challenges, from "make this old photo look new" to "put this specific text on this poster."
It tests the AI on things that matter to real people: Did it follow the instructions? Did it keep the background safe? Does it look realistic?
The Result: FireRed-Image-Edit scored higher than almost every other open-source model and even beat some expensive, closed-source commercial giants.

5. Why This Matters

Usually, to get a better AI, companies just build bigger, heavier models that cost millions of dollars to run.

FireRed's Philosophy: Instead of building a bigger truck, they built a smarter engine. By cleaning the data better and teaching the model more efficiently, they achieved top-tier results without needing a supercomputer the size of a house.

In a nutshell: FireRed-Image-Edit is a highly trained digital artist that has been fed a massive, perfectly organized library of examples, taught to listen carefully to instructions, and trained to never lose the identity of the subject. It's a tool that brings professional-grade photo editing to everyone, powered by smart engineering rather than just brute force.

1. Problem Statement

The current landscape of instruction-based image editing faces two critical challenges:

The "Black Box" vs. "Brute Force" Dichotomy: Proprietary commercial models (e.g., Nano Banana Pro, Seedream) offer high fidelity but lack transparency and reproducibility. Conversely, open-source models often rely on massive parameter scaling (tens of billions of parameters), creating unsustainable computational burdens for training and deployment.
Data and Evaluation Gaps: There is a lack of systematically curated, high-quality datasets specifically for image editing that balance generation and editing tasks. Furthermore, existing benchmarks often fail to capture the nuances required for production-ready applications, relying on theoretical metrics rather than practical user experience (e.g., instruction alignment, identity preservation, and text fidelity).

2. Methodology

FireRed-Image-Edit is a Diffusion Transformer framework that achieves state-of-the-art (SOTA) performance through a holistic optimization of data curation, training efficiency, and reinforcement learning strategies.

A. Data Engineering (1.6B $\to$ 100M+ High-Quality Samples)

The team constructed a massive 1.6 billion sample corpus (900M Text-to-Image, 700M Image-to-Image) and distilled it into a balanced 100M+ training set via a rigorous pipeline:

Multi-Stage Filtering: Includes deduplication (global, pair-level, and multi-metric), photometric/statistical filtering, artifact removal, and AIGC detection to ensure data authenticity.
Data Production Engine: To address data scarcity in specialized tasks, they developed a forward construction pipeline using:
- Instructional Control: Synthesizing expert models via VLMs and edit-target lexicons.
- Structured Control: Using masks and pose keypoints (SAM, DWpose) for precise spatial control.
- Model-Free Templates: Using 3D parametric templates and algorithmic filters for deterministic edits.
Captioning Engine: Generates three types of instructions: Detailed (ground truth), Concise (brevity), and User-Like (natural language), ensuring the model generalizes from technical commands to colloquial requests.
Post-Filtering: A specialized multimodal evaluation model (based on Qwen3-VL) performs automated quality assessment and hard-negative mining to filter out semantic misalignments.

B. Model Architecture & Training Efficiency

Architecture: Built on a Double-Stream Multi-Modal Diffusion Transformer (MM-DiT). It unifies text embeddings, VAE latent tokens, and reference image features into a single stream, utilizing 3D Unified RoPE to handle variable input counts and resolutions.
Efficiency Optimizations:
- Multi-Condition Aware Bucket Sampler: Groups batches by aspect ratio and input image count ( $N$ ) to minimize padding and computational waste.
- Stochastic Instruction Alignment: Randomly permutes or drops reference images during collation and dynamically re-indexes text prompts (e.g., "Fig 1" $\to$ "Fig 2") to force the model to decouple spatial order from content.
- System-Level: Uses FSDP/HSDP, gradient checkpointing, and pre-computed VLM embeddings to maximize throughput.

C. Multi-Stage Training Pipeline

Pre-training & Continued Pre-training (CT): Establishes a robust visual vocabulary and handles arbitrary resolutions using curriculum learning (progressive timestep sampling).
Supervised Fine-Tuning (SFT): Aligns the model with high-fidelity, instruction-following data using a smaller learning rate and EMA (Exponential Moving Average) for stability.
Reinforcement Learning (RLHF):
- Asymmetric Gradient Optimization (DPO): Introduces a weighting coefficient ( $\omega > 1$ ) to prioritize Positive Sample Reinforcement (PSR), preventing the "double degradation" where the model loses capability on chosen samples while avoiding rejected ones.
- Diffusion NFT: An online RL method using a layout-aware OCR reward for text editing. It penalizes layout inconsistencies (e.g., oversized text) and uses semi-hard sample mining to focus on the model's capability boundary.
- Consistency Loss: A differentiable loss that preserves identity (especially for faces) by applying dynamic weighting based on noise levels, ensuring identity is locked during the early semantic formation stage.

3. Key Contributions

FireRed-Image-Edit Model: A highly efficient, open-source diffusion transformer that rivals proprietary systems without requiring massive parameter scaling.
Comprehensive Data Pipeline: A novel engine for generating and filtering 100M+ high-quality, balanced editing pairs, addressing long-tail data gaps through "check and fill" strategies.
Advanced Training Techniques:
- Asymmetric Gradient Optimization for stable DPO.
- Layout-Aware OCR Rewards for precise text editing.
- Stochastic Instruction Alignment for robust multi-reference handling.
REDEdit-Bench: A new, comprehensive benchmark featuring 1,673 bilingual (Chinese-English) edit pairs across 15 categories (including beautification and low-level enhancement). It introduces VLM Judge and OCR metrics to evaluate text fidelity and visual coherence more rigorously than previous benchmarks.

4. Results

Extensive evaluations on REDEdit-Bench, ImgEdit, and GEdit demonstrate:

Performance: FireRed-Image-Edit achieves SOTA performance among open-source models and is competitive with top-tier proprietary systems (e.g., Nano Banana Pro, Seedream 4.5).
Human Evaluation: In blind human evaluations, it scored highest in Consistency Preservation (maintaining non-edited regions) and led in Prompt Following against most competitors.
Specific Capabilities:
- Text Editing: Superior character accuracy and layout preservation compared to baselines.
- Virtual Try-on: Produces more coherent garment geometry and better alignment with styling instructions.
- Creative Editing: Successfully handles complex structural changes and abstract concepts.

5. Significance

This work demonstrates that systematic engineering in data curation, training efficiency, and evaluation design can bridge the gap between open-source and proprietary models. By moving away from the "scale-only" paradigm, FireRed-Image-Edit proves that a well-optimized, smaller-scale model can achieve superior controllability, identity preservation, and instruction alignment. The release of the model, code, and the REDEdit-Bench provides a crucial foundation for future research in controllable and high-fidelity image editing.

FireRed-Image-Edit-1.0 Technical Report

1. The Ingredients: Cooking a 100-Million-Recipe Feast

2. The Classroom: Teaching the AI to Listen

3. The Safety Net: Keeping the Face (and Identity) Intact

4. The Exam: REDEdit-Bench

5. Why This Matters

1. Problem Statement

2. Methodology

A. Data Engineering (1.6B →\to→ 100M+ High-Quality Samples)

B. Model Architecture & Training Efficiency

C. Multi-Stage Training Pipeline

3. Key Contributions

4. Results

5. Significance

More like this

Learning Kalman Policy for Singular Unknown Covariances via Riemannian Regularization

Sample entropy for graph signals: An approach to nonlinear dynamic analysis of data on networks

Scalar Federated Learning for Linear Quadratic Regulator

Finite-Step Invariant Sets for Hybrid Systems with Probabilistic Guarantees

Differentiable Invariant Sets for Hybrid Limit Cycles with Application to Legged Robots

A. Data Engineering (1.6B $\to$ 100M+ High-Quality Samples)