Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

This paper introduces Diffusion Blend, a novel inference-time framework that enables diffusion models to dynamically align with any user-specified linear combination of multiple reward functions and regularization constraints by blending backward diffusion processes from fine-tuned models, thereby eliminating the need for repeated fine-tuning while outperforming existing baselines.

Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, Panganamala R. Kumar

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you have a master chef (the Diffusion Model) who is incredibly talented at cooking any dish you ask for, from a simple sandwich to a complex soufflé. This chef was trained on millions of recipes and knows how to cook everything generally well.

However, sometimes you want something specific:

  • "Make this sandwich look artistic and beautiful."
  • "Make sure this sandwich looks exactly like the photo I sent you."
  • "Make it healthy (low calories) but still tasty."

The problem is that the chef usually has to pick one style. If you train the chef to be a "Beauty Expert," they might forget how to follow your photo instructions. If you train them to be a "Photo-Follower," the food might look ugly.

Traditionally, if you wanted a different mix (e.g., 50% beauty, 50% accuracy), you'd have to send the chef back to culinary school for weeks to learn that specific combination. If you wanted a new mix tomorrow, you'd have to send them back to school again. This is slow, expensive, and impractical.

The Solution: "Diffusion Blend"

This paper introduces a clever new trick called Diffusion Blend. Instead of sending the chef back to school, they create a "Mixing Station" at the moment you order the food (inference time).

Here is how it works, using a few analogies:

1. The "Specialist Chefs" (The Training Phase)

Before you ever order, the researchers train a few "Specialist Chefs" once and for all:

  • Chef A is an expert at making food look Beautiful (Aesthetics).
  • Chef B is an expert at making food look Exactly Like the Photo (Text-Image Alignment).
  • Chef C is an expert at making food Healthy (Human Preference).

These chefs are "fine-tuned" models. They are ready to go.

2. The "Magic Mixer" (The Inference Phase)

Now, you walk up to the counter and say: "I want a sandwich that is 70% Beautiful and 30% Photo-Accurate."

In the old days, the chef would have to stop, re-learn, and start over.
With Diffusion Blend, the system instantly takes the "thought process" (the mathematical recipe) of Chef A and Chef B and blends them together in real-time.

  • It doesn't just average their final pictures.
  • It blends their step-by-step cooking instructions as the food is being created.
  • It's like having two chefs whispering instructions to each other simultaneously, and the system listens to them in the exact ratio you asked for (70% Chef A, 30% Chef B).

The Result: You get a sandwich that is perfectly balanced between beauty and accuracy, created instantly, without the chef ever needing to go back to school.

The Three "Magic Tools"

The paper proposes three specific ways to use this mixer:

  1. DB-MPA (The Multi-Flavor Mixer):
    This is the main tool. It lets you mix any number of specialists. Want 40% Beauty, 40% Accuracy, and 20% Health? Just dial it in. The system blends the "whispers" of all three chefs instantly.

  2. DB-KLA (The "Strictness" Dial):
    Sometimes, you want the food to be creative, but you don't want the chef to get too wild and forget the original recipe entirely. This tool lets you control how much the chef can "drift" from their original training.

    • Low Dial: The chef stays very close to their original, safe style.
    • High Dial: The chef is allowed to be very creative and bold.
      You can turn this knob up or down instantly without retraining.
  3. DB-MPA-LS (The "Lightweight" Mixer):
    Blending three chefs at once can be computationally heavy (like asking three people to talk at once). This version is a smart shortcut. Instead of listening to all chefs at every single second, it randomly picks one chef to listen to at each step, based on your percentages.

    • Analogy: Instead of a choir singing together, it's like a conductor pointing to different singers one by one so fast that your ear hears a perfect blend.
    • Benefit: It runs just as fast as the original chef, but still gives you the blended result.

Why is this a Big Deal?

  • No More Waiting: You don't need to wait days for a new model to be trained for your specific taste.
  • Infinite Customization: You can tweak your preferences on the fly. "Make it a bit more blue," "Make it less realistic," "Make it more artistic."
  • Solves Conflicts: It handles situations where goals fight each other (e.g., "Make it look like a photo" vs. "Make it look like a painting") much better than previous methods, finding the perfect middle ground.

In summary: Diffusion Blend turns the rigid, "one-size-fits-all" AI image generator into a flexible, user-controlled tool. It allows you to be the director, mixing and matching different "expert" styles instantly to get exactly the image you imagine, right now.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →