Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

Imagine you are trying to recreate a beautiful, complex painting, but you only have a few brushstrokes to do it.

In the world of AI art (specifically Diffusion Models), the computer starts with a canvas full of random static (like TV snow) and slowly "denoises" it step-by-step until a clear image appears. The problem is that to get a perfect picture, the computer usually needs to take hundreds of tiny steps. This is slow and expensive, like trying to walk across a room by taking one-inch steps.

To fix this, researchers have tried to teach the computer to take bigger, smarter steps (fewer steps, or "NFEs"). However, existing methods are like rigid rulebooks: they force the computer to take steps in a specific way (e.g., "always look at the noise," or "always look at the data"). If the rulebook doesn't match the specific painting style, the result looks blurry or weird.

Enter Dual-Solver, the new "smart navigator" introduced in this paper.

The Core Idea: The "Swiss Army Knife" Step

Think of the old methods as a hammer. It's great for nails, but terrible for screws. Dual-Solver is a Swiss Army Knife. It doesn't just have one way to move; it has a set of adjustable tools that change depending on the situation.

The paper introduces three "knobs" (learnable parameters) that the AI learns to turn automatically:

The "Prediction" Knob ( $\gamma$ ):
- The Problem: Sometimes the AI should guess what the "noise" looks like, sometimes what the "final image" looks like, and sometimes how fast the image is changing (velocity). Old solvers had to pick one and stick with it.
- The Dual-Solver Fix: This knob lets the AI smoothly blend between these three guesses. It's like a chef who doesn't just use salt or sugar, but knows exactly how much of each to mix for the perfect flavor at every moment.
The "Map" Knob ( $\tau$ ):
- The Problem: Imagine trying to walk across a field. Sometimes walking in a straight line (linear) is best. Other times, walking in a spiral or following a winding path (logarithmic) gets you there faster. Old solvers were stuck on one type of map.
- The Dual-Solver Fix: This knob changes the "geometry" of the path. It allows the AI to switch between a straight road and a winding trail, choosing the most efficient route for that specific step.
The "Correction" Knob ( $\kappa$ ):
- The Problem: Even with a good map, you might still take a wrong turn. You need a way to fix small errors without starting over.
- The Dual-Solver Fix: This knob adds a tiny "safety net" or a fine-tuning adjustment to the step. It's like a tightrope walker using a balancing pole to make micro-adjustments so they don't fall, ensuring the step stays accurate even when taken quickly.

How Does It Learn? (The "Teacher" vs. The "Judge")

Usually, to teach a student (the solver) to walk faster, you show them a video of a master walker (a high-quality, slow solver) and say, "Copy my steps exactly." This is called Regression.

The Issue: This is hard. The student gets confused trying to mimic the exact path, especially when they are only allowed to take 3 or 5 steps.

Dual-Solver uses a clever trick called Classification.

The Analogy: Instead of asking the student to copy the master's steps, we give them a Judge (a pre-trained image classifier, like a robot that knows what a "cat" looks like).
The AI takes a few steps, generates an image, and asks the Judge: "Does this look like a cat?"
If the Judge says "No," the AI knows it went off-track and adjusts its knobs (the Swiss Army Knife tools) to try again.
Why it's better: The AI doesn't need to memorize the exact path of a master. It just needs to learn to stay on the "right side of the line" where the Judge says, "Yes, that's a cat!" This allows it to find its own unique, efficient path to a high-quality image.

The Results: Fast and Furious

The researchers tested this on various AI art models (like DiT, SANA, and PixArt).

The Old Way: To get a good picture, you might need 20–50 steps.
Dual-Solver: Can get a picture that is just as good (or better) in only 3 to 9 steps.

Summary

Dual-Solver is like upgrading a car from a vehicle with a fixed gear ratio to one with a continuously variable transmission (CVT) and a GPS that learns from a traffic judge.

It doesn't force the AI to follow a rigid path.
It lets the AI adjust its strategy (prediction type), its map (integration domain), and its corrections (residuals) on the fly.
It learns by asking "Is this a good image?" rather than "Did you copy my steps?"

The result? You get high-quality AI art in a fraction of the time, making it much faster and cheaper to generate images.

1. Problem Statement

Diffusion models have achieved state-of-the-art image quality but suffer from high inference costs due to the large number of function evaluations (NFEs) required for sampling. While Ordinary Differential Equation (ODE) solvers have been adopted to accelerate sampling, existing methods face several limitations:

Rigid Prediction Types: Classical solvers typically commit to a single prediction type (noise, velocity, or data) during training and sampling. However, in discrete time, these prediction types yield different update steps, leading to suboptimal trajectories.
Fixed Integration Domains: Solvers often use fixed integration domains (e.g., linear or logarithmic time steps), which may not be optimal for all stages of the sampling process.
Training Overhead: "Learned solvers" (which optimize solver parameters) usually rely on regression-based objectives. These require a "teacher" solver running at high NFEs to generate target trajectories or samples, incurring massive computational overhead and often failing to generalize well in the very low-NFE regime (e.g., $NFE \le 5$ ).

2. Methodology: Dual-Solver

The authors propose Dual-Solver, a generalized predictor-corrector sampler that unifies and extends existing ODE solvers through three types of learnable parameters. It retains a standard predictor-corrector structure while preserving second-order local accuracy.

A. Generalized Integral Formulation

Dual-Solver introduces a unified integral formulation parameterized by three learnable variables per step:

Prediction Interpolation ( $\gamma$ ):
- Instead of choosing between noise ( $\epsilon_\theta$ ), velocity ( $v_\theta$ ), or data ( $x_\theta$ ) predictions, Dual-Solver uses a parameter $\gamma$ to continuously interpolate between them.
- $\gamma = -1$ corresponds to noise prediction, $\gamma = 0$ to velocity prediction, and $\gamma = 1$ to data prediction.
- This allows the solver to dynamically select the optimal prediction type for each step.
Domain Change ( $\tau$ ):
- The solver applies a log-linear transformation $L(y; \tau) = \frac{\log(1+\tau y)}{\tau}$ to the integration domain.
- As $\tau \to 0$ , it behaves like a linear transform; as $\tau = 1$ , it behaves like a logarithmic transform.
- This allows the solver to learn the optimal integration domain (linear vs. log) to minimize discretization errors.
Residual Adjustment ( $\kappa$ ):
- A parameter $\kappa$ is introduced to adjust the residual term in the Taylor expansion approximation.
- It adds flexibility to the residual term ( $O((\Delta t)^2)$ ) while maintaining second-order local accuracy, allowing the solver to correct for higher-order errors without increasing the order of the method.

B. Sampling Scheme

Dual-Solver employs a Predictor-Corrector scheme:

Predictor: A first-order step using the current state and model evaluations.
Corrector: A second-order step that refines the prediction using fresh model evaluations at the new time step.
The parameters ( $\gamma, \tau, \kappa$ ) are learned separately for the predictor and the corrector at each step.

C. Classification-Based Parameter Learning

A key innovation is the learning strategy. Instead of regression against high-NFE teacher trajectories (which is expensive and error-prone at low NFEs), Dual-Solver uses Hard-Label Classification:

Mechanism: The solver generates a sample $x_0$ . This sample is passed through a frozen, pretrained classifier (e.g., MobileNet for ImageNet, CLIP for text-to-image).
Objective: The solver parameters are optimized to minimize the cross-entropy loss between the classifier's predicted class probabilities and the ground-truth class label (or text prompt).
Advantage: This approach does not require generating high-quality teacher samples. It focuses on ensuring the generated sample lies on the correct side of the classifier's decision boundary, which correlates strongly with visual fidelity.

3. Key Contributions

Unified Solver Framework: Dual-Solver generalizes multistep samplers by learning to interpolate between prediction types, integration domains, and residual terms, effectively covering the space of existing solvers (like DPM-Solver++) as special cases.
Efficient Learning Strategy: The introduction of hard-label classification for solver parameter learning eliminates the need for expensive teacher trajectories, making it feasible to train high-performance solvers for low-NFE regimes.
Second-Order Accuracy: Despite the learnable parameters, the method is mathematically proven to maintain second-order local truncation error.
Broad Applicability: The method is validated across diverse backbones, including Diffusion Transformers (DiT), Flow Matching models (GM-DiT), and text-to-image models (SANA, PixArt-α).

4. Experimental Results

The authors evaluated Dual-Solver on ImageNet (DiT, GM-DiT) and text-to-image tasks (SANA, PixArt-α) using FID (Fréchet Inception Distance) and CLIP scores.

Low-NFE Performance: In the critical low-NFE regime ( $3 \le NFE \le 9$ $3 \leq N F E \leq 9$ ), Dual-Solver consistently outperforms state-of-the-art baselines (DDIM, DPM-Solver++, BNS-Solver, DS-Solver).
- ImageNet (DiT): At $NFE=5$ , Dual-Solver achieves an FID of 3.52, significantly better than DPM-Solver++ (22.19) and BNS-Solver (14.53).
- Text-to-Image (SANA): At $NFE=3$ , Dual-Solver achieves an FID of 21.79, outperforming all baselines which range from 45.05 to 48.65.
Ablation Studies:
- Predictor-Corrector: The combination of a first-order predictor and second-order corrector ( $p1c2$ ) yielded the best results.
- Learnability: Allowing all parameters ( $\gamma, \tau, \kappa$ ) to be learned yielded superior performance compared to fixing them to specific values (e.g., fixing $\gamma$ to 1 for data prediction).
- Classifier Selection: The study revealed a "V-shaped" relationship between classifier accuracy and sample quality. Moderately accurate classifiers (e.g., MobileNetV3) often yielded better FID scores than highly accurate ones (e.g., ViT-H), suggesting that overly strict decision boundaries can hinder generative diversity.
Interpolation: The learned parameters for a specific NFE can be linearly interpolated to work effectively for intermediate NFEs, demonstrating the robustness of the learned trajectory.

5. Significance

Dual-Solver represents a significant step forward in efficient diffusion sampling. By treating the solver itself as a learnable component with a flexible, unified mathematical structure, it overcomes the rigidity of classical numerical methods. The shift from regression-based to classification-based learning is particularly impactful, as it drastically reduces the computational cost of training specialized solvers while delivering superior sample quality in the low-NFE regime where inference speed is most critical. This makes high-fidelity generative modeling more accessible for real-time applications.

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

The Core Idea: The "Swiss Army Knife" Step

How Does It Learn? (The "Teacher" vs. The "Judge")

The Results: Fast and Furious

Summary

1. Problem Statement

2. Methodology: Dual-Solver

A. Generalized Integral Formulation

B. Sampling Scheme

C. Classification-Based Parameter Learning

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores

Scaling Attention via Feature Sparsity

Latent Semantic Manifolds in Large Language Models