Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

The Big Picture: Fixing "Blurry" Medical Photos

Imagine you are a doctor trying to look at a patient's skull to plan surgery. You have two types of X-ray machines:

The "Quick & Dirty" Scanner (CBCT): It's fast and cheap, but the pictures often look like they were taken through a foggy window or have weird dark shadows (called "shade artifacts") that hide important details.
The "Gold Standard" Scanner (MDCT): It takes perfect, crystal-clear pictures, but it's expensive, slow, and exposes the patient to more radiation.

The Goal: The researchers wanted to build a "magic translator" that takes the blurry, shadowy Quick Scanner photos and instantly turns them into perfect, Gold Standard photos—without needing the Gold Standard machine.

The Problem: The "Overconfident Artist"

In the past, scientists used AI models (like GANs) to do this translation. Think of these models as overconfident artists.

They are great at guessing what the picture should look like.
But sometimes, they get too confident and start painting things that aren't there, or they leave the weird shadows from the original photo because they are "lazy" and just copy the bad parts.
If you ask them to fix a shadow, they might accidentally erase a real bone or create a fake tumor. This is called the "Quality vs. Diversity" trilemma: they can't be perfect, diverse, and fast all at once.

The Solution: A "Human-Guided GPS"

The authors created a new system that combines three powerful ideas to fix this. Here is how they did it, using a travel analogy:

1. The Schrödinger Bridge (The Perfect Road Map)

Instead of just guessing the destination, this method builds a perfect bridge between the "Blurry Start" (CBCT) and the "Clear End" (MDCT).

Old Way: Imagine trying to walk from your house to a friend's house in the dark, hoping you don't trip.
New Way: The Schrödinger Bridge is like a GPS with a live traffic feed. It knows exactly where you start and exactly where you need to end up. It calculates the smoothest, safest path in between, ensuring you don't wander off into a field (creating fake anatomy) or get stuck in a puddle (leaving artifacts).

2. The "Human Referee" (Binary Feedback)

This is the most creative part. Usually, AI needs a massive textbook of "Good vs. Bad" examples to learn. That's hard to get in medicine.

The Trick: The researchers didn't write a textbook. Instead, they acted like a sports referee.
They let the AI generate a few different versions of the photo. Then, a human expert looked at them and simply said, "Good" (thumbs up) or "Bad" (thumbs down).
The AI didn't need to know why it was bad; it just needed to know the score. This is like training a dog: you don't explain the rules of fetch; you just say "Good boy" when they bring the ball back.

3. The "Magic Remote Control" (Classifier-Free Guidance)

Once the AI knows what a "Good" photo looks like, how do we make it do it?

The researchers gave the AI a remote control with a volume knob.
If the knob is turned up high, the AI listens very closely to the "Good" signal and aggressively removes the shadows.
If the knob is low, it's more relaxed.
This allows the doctor to say, "Hey, make the shadows disappear, but don't change the shape of the bone!" The AI obeys instantly.

How It Works in Practice

The Setup: The AI starts with a blurry photo. It has a "rough draft" of what the clear photo might look like (generated by an older AI).
The Journey: The AI starts walking the "Bridge" from the blurry photo to the clear one.
The Check-in: At every step, the AI asks the "Human Referee" (via the binary feedback): "Is this looking good?"
The Correction: If the path looks like it's drifting toward a shadow, the "Remote Control" steers it back toward the "Good" path.
The Result: In just 10 steps (which is incredibly fast for AI), the blurry photo becomes a sharp, shadow-free medical image.

Why This Matters (The "So What?")

Speed: Old methods took hundreds of steps to generate an image. This one takes 10. It's like going from a slow, winding dirt road to a high-speed bullet train.
Safety: Because it uses the "Bridge" method, it doesn't invent fake bones or erase real ones. It keeps the anatomy honest.
Human Control: It doesn't need a PhD in AI to use. A doctor just gives a thumbs up or down, and the AI learns instantly.
The "Negative" Test: The researchers even tested if they could tell the AI to add shadows on purpose (to simulate bad scans for training). The AI could do that too! This proves the AI truly understands the concept of "shadows" and isn't just memorizing pictures.

Summary Analogy

Imagine you are trying to restore an old, scratched-up painting.

Old AI: A painter who guesses what the painting looked like but often paints over the original face with a new one.
This New AI: A restoration team with a laser-guided brush. They have a map of the original painting (the Bridge). A master art critic stands next to them, pointing and saying "No, that's a scratch, fix it" or "Yes, that's the original color." The team moves incredibly fast (10 steps) and fixes the scratches without changing the face of the person in the painting.

This paper shows that by combining a smart mathematical map with simple human feedback, we can make medical imaging safer, faster, and more reliable.

1. Problem Statement

The conversion of Cone-Beam Computed Tomography (CBCT) to Multi-Detector Computed Tomography (MDCT) is critical for clinical applications but faces significant challenges:

Artifacts: CBCT images often suffer from severe "shade artifacts" (shading inconsistencies) and noise due to the physics of the acquisition, which degrade diagnostic quality.
GAN Limitations: While Generative Adversarial Networks (GANs) have been used for image translation, they suffer from the quality–sampling–diversity trilemma. They often exhibit mode collapse, leading to biased outputs, residual artifacts, and a lack of diversity. Fine-tuning GANs often results in overfitting to specific datasets, reducing generalizability.
Human Feedback Integration: Existing methods that incorporate human feedback typically rely on Reinforcement Learning with Human Feedback (RLHF), which requires training a separate, complex reward model. This adds computational overhead and architectural complexity, making it difficult to scale for clinical deployment.
Ground Truth Scarcity: Acquiring pixel-aligned, high-quality MDCT ground truth for CBCT inputs is often ethically and technically impossible, necessitating robust methods that can work with pseudo-targets and expert preference data.

2. Methodology

The authors propose a novel framework combining Schrödinger Bridge (SB) dynamics, Conditional Diffusion, and Human-Guided Feedback without a reward model.

A. Core Framework: Schrödinger Bridge (SB)

Instead of standard diffusion models that denoise from pure Gaussian noise, the authors utilize the Image-to-Image Schrödinger Bridge (I2SB) formulation.

Boundary Conditions: The process defines a stochastic path between two empirical distributions: the source CBCT image ( $z_0$ ) and a pseudo-target MDCT-like image ( $z_1$ ) generated by a pretrained unpaired CycleGAN.
Bidirectional Flow: The SB formulation constructs an optimal probabilistic path connecting $z_0$ and $z_1$ , ensuring boundary consistency. This mitigates the "drift" often seen in standard diffusion and improves anatomical fidelity.
Closed-Form Intermediates: Intermediate latent states ( $z_t$ ) are sampled from a closed-form Gaussian distribution conditioned on both endpoints, enabling stable and efficient generation.

B. Human-Guided Conditional Diffusion

To incorporate human preference without a reward model, the authors adapt Direct Preference Optimization (DPO) principles to the diffusion domain using Classifier-Free Guidance (CFG).

Binary Feedback Signal: Human experts label generated slices as "Good" ( $r=0$ ) or "Bad" ( $r=1$ ) based on artifact severity.
Conditioning Mechanism: The score network $s_\theta$ $s_{θ}$ is conditioned on the binary feedback $r$ $r$ alongside the timestep $t$ $t$ and source image $z_0$ $z_{0}$ .
- Architecture: A UNet takes concatenated embeddings of $z_t$ , $z_0$ , time $t$ , and reward $r$ .
- Embedding Strategy: The time embedding is added as a residual bias to all layers to preserve diffusion dynamics. The reward embedding is applied multiplicatively only to the decoder layers, allowing the model to modulate high-level semantic features (like artifact suppression) without disrupting low-level anatomical structures.
Sampling Strategy (CFG): The generation is guided by a linear combination of conditional and unconditional scores:
$\tilde{s}(z_t) = (1 + w) \cdot s_\theta(z_t | z_0, t, r) - w \cdot s_\theta(z_t | \emptyset)$
where $w$ is the guidance scale. This steers the generation toward the "Good" class.

C. Incremental Learning via Tournament Selection

To refine the model without a separate reward model, the authors employ an iterative feedback loop:

Exploration: Generate multiple candidate reconstructions for a "Bad" input using varying guidance scales ( $w$ ) and model checkpoints.
Tournament Selection: Human evaluators perform pairwise comparisons (1-vs-1) in a tournament format to select the most clinically preferred image for each slice.
Augmentation: The winning images form a curated dataset ( $Z_{pref}$ ) used to incrementally fine-tune the score network, progressively aligning the model with human preferences.

3. Key Contributions

Reward-Model-Free Preference Alignment: The paper introduces a method to integrate human feedback into diffusion models using CFG and binary signals, eliminating the need for a separate, complex reward model training phase.
SB-Based Artifact Suppression: By leveraging the Schrödinger Bridge formulation, the method enforces strict boundary consistency between the noisy CBCT input and the pseudo-target, effectively suppressing shade artifacts while preserving anatomical details.
Efficient Sampling: The framework achieves high-quality results with only 10 sampling steps (NFE), significantly faster than traditional diffusion models requiring hundreds of steps, making it suitable for real-time clinical use.
Controllability and Interpretability: The model demonstrates the ability to not only suppress artifacts but also to synthetically generate artifacts when guided by "Bad" feedback, proving it has learned a controllable semantic space rather than just memorizing patterns.

4. Experimental Results

The method was evaluated on a dataset of 20 CBCT and 28 MDCT volumes (17 training, 3 testing subjects).

Artifact Suppression:
- ARR (Artifact Reduction Rate): 98.25% (Training) and 96.98% (Test).
- ARSR (Artifact Reduction Success Rate): 99.35% (Training) and 96.23% (Test).
- The proposed method outperformed state-of-the-art baselines (Park2022, Park2025) which relied on GANs and fine-tuning.
Structural Fidelity:
- Achieved superior SSIM (0.9971) and Dice Coefficient (83.95%) compared to baselines.
- LPIPS (Learned Perceptual Image Patch Similarity) was significantly lower (0.0015), indicating better perceptual similarity to expert-approved references.
Sampling Efficiency:
- Performance remained robust even at 10 NFE. Interestingly, increasing NFE beyond 10 led to slight metric degradation due to over-smoothing and distributional drift away from the artifact-prone pseudo-targets, validating the efficiency of the SB approach.
Ablation Studies:
- Removing the conditioning on the input CBCT ( $z_0$ ) caused significant anatomical distortions, confirming $z_0$ is essential as a spatial anchor.
- The model successfully responded to "negative" requests (generating artifacts on clean images), demonstrating semantic controllability.

5. Significance

This work represents a significant advancement in medical image translation by bridging the gap between generative AI efficiency and clinical safety requirements.

Clinical Scalability: By removing the need for a reward model and reducing sampling steps to 10, the framework is computationally feasible for real-time clinical deployment.
Human-in-the-Loop: It provides a practical mechanism for radiologists to guide AI generation directly, ensuring outputs align with clinical preferences without complex retraining pipelines.
Robustness: The ability to handle pseudo-targets and suppress specific artifacts (shade) while maintaining anatomical integrity makes it highly relevant for dental and maxillofacial imaging where CBCT is common but MDCT is the gold standard.

In summary, the paper proposes a Human-Guided Schrödinger Bridge framework that effectively solves the CBCT-to-MDCT translation problem by combining the stability of optimal transport with the controllability of classifier-free guidance, achieving state-of-the-art artifact suppression with high efficiency.