RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

The Big Problem: The "Blurry" vs. The "Crystal Clear"

Imagine you are a doctor planning radiation therapy for a cancer patient. You need a perfect, high-definition map of their insides to aim the radiation beams accurately. This map is usually a CT scan. It's like a crystal-clear, high-resolution photograph where every bone and organ is perfectly defined, and the numbers (called Hounsfield Units) tell you exactly how dense the tissue is.

However, when the patient is actually lying on the treatment table getting ready for radiation, the machine takes a different kind of picture called a CBCT.

The CBCT is like a grainy, low-light security camera photo. It's taken quickly and often has weird "artifacts" (glitches, streaks, or blurriness) and the numbers are all wrong.
The Goal: You want to turn that grainy security photo into a crystal-clear, high-definition photo so you can calculate the radiation dose safely.

The Old Way: Trying to Match Strangers

To teach a computer how to fix the grainy photo, you usually need a "teacher." The teacher needs to see a grainy photo and the exact same clear photo of the same person at the same time to learn the difference.

The Problem: In the real world, getting these "matched pairs" is a nightmare.

Patients move between scans.
Their organs shift (like a balloon deflating slightly).
Time passes between the two scans.
Sometimes, you just don't have the clear photo for that specific patient.

So, researchers have to teach the computer using unpaired data. Imagine trying to teach someone how to translate French to English, but you only have a pile of French books and a separate pile of English books, with no way to know which French sentence corresponds to which English sentence.

The New Solution: RAFM (The "Smart Librarian" Approach)

The authors of this paper introduced a new method called RAFM. They use a concept called Flow Matching, which is like a river flowing from a messy state (CBCT) to a clean state (CT).

Here is the tricky part: To teach the river how to flow, you need to pick a "start point" (grainy photo) and an "end point" (clear photo).

The Old Mistake: If you just grab a random grainy photo and pair it with a random clear photo, the computer gets confused. It's like trying to teach a French student to translate a sentence about "apples" by showing them an English sentence about "cars." The computer learns the wrong lessons, and the final image looks weird or distorted.
The RAFM Fix: The authors realized that even if the photos aren't from the same patient, they might look similar. A grainy photo of a hip bone should be paired with a clear photo of a hip bone, not a clear photo of a skull.

The "Smart Librarian" Analogy

Think of the computer as a student trying to learn.

The Memory Bank: The computer has a giant library of all the clear CT scans it has ever seen.
The Search Tool (DINOv3): The computer uses a super-smart "librarian" (an AI encoder) that can look at a grainy CBCT photo and instantly understand its "vibe" or "content."
The Retrieval: When the computer sees a grainy photo of a pelvis, the librarian doesn't just pick a random clear photo. It searches the library and finds the most similar clear photo of a pelvis.
The Lesson: Now, the computer learns: "Okay, to turn this specific grainy pelvis into a clear one, I should look at that specific clear pelvis."

This process is called Retrieval-Augmented. The computer "retrieves" the best possible teacher for every single lesson, even though it doesn't have the exact original pair.

How It Flows (The "River" Metaphor)

Once the computer has these "smartly matched" pairs, it uses Rectified Flow.

Imagine a river flowing from a muddy swamp (the grainy CBCT) to a pristine lake (the clear CT).
The computer learns the exact path the water should take to get from the mud to the lake without getting stuck or swirling in circles.
Because the "teachers" (the retrieved pairs) are now semantically similar (pelvis-to-pelvis), the river flows smoothly and directly. It doesn't waste energy trying to turn a pelvis into a skull.

Why This Matters

The paper tested this on a strict challenge where the computer was never allowed to see a matched pair of the same patient during training.

The Result: RAFM beat all other methods. It produced clearer images, fewer glitches, and preserved the patient's anatomy (body structure) much better than previous AI models.
The Benefit: Doctors can now rely on these AI-generated "synthetic CTs" to plan radiation therapy more accurately, even when they only have the grainy, low-quality scans from the treatment machine.

Summary in One Sentence

RAFM is a smart AI that learns to fix blurry medical scans by acting like a librarian: instead of guessing random matches, it searches a massive library to find the most similar clear image for every blurry one, allowing it to learn the perfect transformation without needing the original "before and after" photos.

1. Problem Statement

Context: In radiotherapy, Cone-Beam CT (CBCT) is routinely acquired for image guidance but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, making it unsuitable for direct dose calculation. Synthetic CT (sCT) generation from CBCT is a critical solution.
Challenge: Ideally, CBCT-to-CT translation requires paired data (CBCT and CT of the same patient at the same time). However, obtaining such paired data is difficult due to temporal gaps, anatomical variations, and registration errors. Consequently, the task must often be treated as unpaired translation.
Limitations of Existing Methods:

GANs: Prone to training instability and sensitivity to hyperparameters.
Diffusion/Schrödinger-Bridge Models: Often complex, computationally heavy, and may still rely on adversarial components.
Rectified Flow (RF): A promising non-adversarial alternative that models translation as deterministic transport between distributions. However, RF theoretically requires "endpoint couplings" (pairing a source sample $x_0$ with a target sample $x_1$ ). In small medical datasets with limited batch sizes, random pairing or batch-local pairing leads to semantically mismatched endpoints (e.g., pairing a CBCT of a hip with a CT of a knee), resulting in noisy transport targets and poor anatomical preservation.

2. Methodology: Retrieval-Augmented Flow Matching (RAFM)

The authors propose RAFM, a framework that integrates Rectified Flow (RF) with a retrieval-augmented strategy to construct high-quality pseudo-pairs without requiring ground-truth paired data.

Core Components:

Rectified Flow Formulation:
- The translation is modeled as a deterministic Ordinary Differential Equation (ODE): $\frac{dx_t}{dt} = v_\theta(x_t, t)$ .
- The path is a straight line interpolation between a source CBCT slice ( $x_0$ ) and a target CT slice ( $x_1$ ): $x_t = (1-t)x_0 + tx_1$ .
- The model $v_\theta$ (a time-conditioned U-Net) is trained to predict the constant velocity vector $x_1 - x_0$ .
Retrieval-Augmented Coupling (The Key Innovation):
- Instead of random or batch-local pairing, RAFM constructs semantic pseudo-pairs using a global memory bank.
- Feature Extraction: A frozen DINOv3 encoder is used to extract feature embeddings for all CT slices.
- Memory Bank: A rolling (FIFO) global memory bank stores CT feature-slice pairs. This bank is larger than the mini-batch but smaller than the full dataset to balance efficiency and coverage.
- Retrieval Process: For every CBCT slice in the current mini-batch, the model computes its feature embedding and retrieves the most similar CT slice from the global memory bank based on cosine similarity.
- Strict Unpaired Nature: The retrieval relies only on feature similarity, not on subject identity or temporal alignment. Thus, the framework remains strictly unpaired.
Training Objective:
- The loss function minimizes the difference between the predicted velocity and the velocity of the retrieved pseudo-pair:
  $L_{RAFM} = \mathbb{E}_{(x_0, x_1) \sim \rho_{retr}, t \sim U(0,1)} \left[ \| v_\theta((1-t)x_0 + tx_1, t) - (x_1 - x_0) \|_2^2 \right]$
- Here, $\rho_{retr}$ is the retrieval-induced empirical coupling.
Inference:
- Given a CBCT input, the model solves the learned ODE from $t=0$ to $t=1$ (using 10-step Euler integration) to generate the synthetic CT.

3. Key Contributions

Novel Framework: First application of Rectified Flow to unpaired medical image translation, addressing the specific challenge of small datasets and small batch sizes.
Retrieval-Augmented Strategy: Introduces a global memory bank and frozen DINOv3 encoder to construct semantically consistent pseudo-pairs, significantly improving coupling quality over random or batch-local methods.
Non-Adversarial Stability: Provides a fully non-adversarial alternative to GANs and diffusion models, ensuring stable optimization and reliable structure preservation.
Strict Evaluation Protocol: Validates the method under a rigorous "subject-level true-unpaired" protocol where no CBCT-CT correspondence exists in the training set.

4. Experimental Results

Dataset: SynthRAD2023 (Pelvis), split into disjoint training sets for CBCT and CT (63 subjects each).
Baselines: Compared against GANs (CycleGAN, GcGAN, CUT) and Diffusion/SB methods (SynDiff, UNSB).

Quantitative Performance (RAFM vs. Best Baseline):

MAE (HU): 101.2 (vs. 104.2 for SynDiff) – Lower is better.
SSIM: 80.96% (vs. 80.27% for CUT) – Higher is better.
PSNR: 25.15 dB (vs. 24.94 dB for SynDiff) – Higher is better.
FID: 53.29 (vs. 62.91 for UNSB) – Lower is better, indicating superior distributional realism.
SegScore: 75.77% (vs. 72.07% for UNSB) – Higher is better, indicating better anatomical consistency for organ segmentation.

Qualitative Results:

RAFM produces images with cleaner artifact suppression and more stable anatomical structures compared to baselines.
Error maps show reduced structural distortion.

Ablation Studies:

Coupling Quality: Random coupling (K=0) performs poorly. Batch-wise matching (K=4) improves results. Retrieval-augmented coupling (K=512) yields the best performance.
Memory Bank Size: Performance peaks at K=512, with diminishing returns for larger banks.
Upper Bound: While RAFM does not reach the performance of a fully paired RF model (which has perfect voxel alignment), it significantly narrows the gap in anatomical metrics (SegScore 75.77% vs. 76.87% for paired).

5. Significance and Impact

Clinical Relevance: Enables the generation of high-quality, dose-calculation-ready CT images from routine CBCT scans without requiring difficult-to-obtain paired data, facilitating adaptive radiotherapy.
Methodological Advancement: Demonstrates that distribution-level coupling can be effectively approximated in small-data regimes through retrieval mechanisms, solving a major bottleneck in applying flow matching to medical imaging.
Efficiency: RAFM is computationally efficient, requiring only 10 ODE steps for inference (much faster than multi-step diffusion) and avoiding the complex dual-generator architecture of CycleGANs.
Open Source: The code is publicly available, promoting reproducibility and further research in unpaired medical image translation.