Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

Imagine you are a doctor looking at an X-ray of a patient's chest. You want to answer a "What if?" question: "What would this patient's X-ray look like if they had a specific type of pneumonia, but kept their exact same rib cage and heart shape?"

This is called Counterfactual Generation. It's like asking an artist to paint a new version of a photo where only the disease changes, but the person's body stays exactly the same.

The problem is that current AI artists (called Diffusion Models) are a bit clumsy. When you ask them to add a disease, they often get so excited about the new "story" that they accidentally redraw the patient's ribs, shift their heart, or blur out healthy lungs. It's like asking a chef to add salt to a soup, and they accidentally replace the whole pot with a new soup.

This paper introduces a new set of "rules" for the AI to follow while it's painting, ensuring the anatomy stays perfect while the disease is added precisely. Here is how it works, using simple analogies:

The Two Big Problems

The "Global Drift" (Structural Instability):
Imagine the AI is a group of painters working on a giant mural. If you tell them to "paint a storm," they might start painting storm clouds over the whole mural, even the parts that were supposed to be a sunny meadow. In medical terms, the AI spreads the "disease" idea to the whole body, distorting the healthy parts.
The "Whisper" Problem (Pathological Instability):
Diseases in X-rays are often tiny and subtle (like a small shadow). The AI is like a radio that only hears loud voices. If the "disease signal" is too quiet, the AI ignores it or makes it too big and messy, failing to put it in the right spot.

The Solution: The "Traffic Cop" and the "Spotlight"

The authors created a system that acts like a Traffic Cop and a Spotlight during the AI's drawing process.

1. The Traffic Cop: Anatomy-Aware Attention

The Analogy: Imagine the AI is a delivery driver trying to drop off packages (disease details). Without rules, the driver might drop packages in the living room, the kitchen, and the bedroom, even if the order was only for the bedroom.
The Fix: The authors give the AI an Organ Mask (a digital stencil of the lungs and heart). They put up "Do Not Cross" signs (gates) around the healthy areas.
How it works: When the AI tries to move information around, the Traffic Cop stops it from spreading the "disease" idea into the healthy ribs or heart. It forces the AI to keep the structural parts (bones, heart shape) locked in place, only allowing changes inside the specific "Lung Zone."

2. The Spotlight: Pathology-Guided Attention

The Analogy: Now, imagine the AI is trying to find a tiny needle in a haystack (the disease). It's struggling to see it.
The Fix: The authors turn on a Spotlight. They tell the AI, "Hey, look right here in the lung. That's where the disease goes."
How it works:
- Amplification: They boost the signal for the disease tokens (the "needle") so the AI pays extra attention to them.
- The "Energy" Check: The system constantly checks: "Is the disease actually staying in the lung, or is it leaking out?" If the disease starts to wander into the wrong area, the system gently pushes the AI back on track, like a gardener pruning a plant to keep it growing in the right direction.

The Result

By using these two tricks while the AI is working (without needing to retrain the whole AI from scratch), the system can:

Keep the skeleton: The ribs and heart look exactly like the original patient.
Add the disease: The pneumonia or fluid appears exactly where the doctor asked, looking realistic and contained.

Why Does This Matter?

This is a game-changer for two reasons:

Medical Training: Doctors can practice on "What if" scenarios. They can see how a specific disease would look on a specific patient's unique body without needing a real patient with that disease.
Data Boosting: AI models need thousands of examples to learn. This method can create infinite, realistic variations of X-rays to help train better diagnostic computers, all while keeping the patient's unique anatomy safe and sound.

In short, this paper teaches the AI to be a precise surgeon rather than a sledgehammer, ensuring that when we simulate a disease, we don't accidentally break the patient's body in the process.

1. Problem Definition

The paper addresses the challenge of Counterfactual Medical Image Generation for Chest X-rays (CXRs). The goal is to simulate plausible pathological changes (e.g., adding pleural effusion or cardiomegaly) to a patient's X-ray while strictly preserving the patient's unique, stable anatomical structures (lung shape, ribs, cardiac contours).

Key Challenges Identified:

Structural Instability (Global Drift): In standard diffusion models, global anatomical semantics stabilize early and propagate through self-attention. When a pathology prompt is introduced, this global propagation often causes "structural drift," where changes unintentionally spread to non-target regions, distorting healthy anatomy.
Pathological Expression Instability: Medical lesions are often subtle, spatially confined, and heterogeneous. This leads to weak attention responses during generation, causing lesions to be suppressed, diffused, or localized inaccurately.
Training Overhead: Existing solutions often require domain-specific retraining or learnable control branches, which are costly and difficult to deploy across different institutions due to data variability.

2. Methodology

The authors propose an inference-time attention regulation framework. This approach imposes constraints on the diffusion sampling process without requiring additional model training, thereby improving generalizability and reducing maintenance costs.

The framework operates on a conditional diffusion model (based on Stable Diffusion v1.5) and introduces two core modules:

A. Anatomy-Aware Attention Regularization

Goal: Prevent structural drift and preserve anatomical consistency.
Mechanism:
- Uses an organ mask ( $M_{anat}$ ) derived from the input image (e.g., lung/heart segmentation).
- Self-Attention Gating: At each denoising step, the self-attention map ( $S_t$ ) is gated by the downsampled organ mask.
- Formula: $S^{anat}_t = S_t \odot (M_{anat} \downarrow q)$ .
- Effect: This restricts anatomy-driven interactions to valid anatomical regions, suppressing the propagation of structural semantics into pathology-sensitive areas and preventing unintended distortions in non-target regions.

B. Pathology-Guided Attention Regulation

Goal: Enhance the localization and extent control of pathological edits.
Mechanism:
- Spatial Prior: Constructs a sample-specific spatial prior map ( $\Omega$ ) based on the text prompt (e.g., "right lung") and organ masks, downsampled to the attention resolution.
- Cross-Attention Reweighting: During early denoising steps ( $t < \mu T$ ), the cross-attention weights for pathology-related tokens are enhanced within the target region using a soft multiplier $(1 + \eta \Omega)$ .
- Latent Correction: Introduces a differentiable "concentration metric" ( $score_{t,k}$ ) to measure how well pathology tokens align with the target ROI. A lightweight gradient-based correction is applied to the intermediate latent ( $z_t$ ) to minimize a "pathology energy" loss ( $L_{path}$ ), steering the denoising trajectory toward better lesion localization.
- Formula: $\hat{z}_t \leftarrow z_t - \alpha_t \nabla_{z_t} L_{path}(t)$ .

3. Key Contributions

Inference-Time Framework: A novel method that achieves controllable counterfactual generation without recurring retraining, making it adaptable to cross-device and cross-domain shifts.
Dual-Attention Regulation: A joint strategy that regularizes self-attention (to preserve structure) and cross-attention (to inject pathology), effectively disentangling anatomical consistency from pathological variation.
Lightweight Latent Correction: A gradient-based update mechanism that refines the denoising trajectory to ensure lesions are accurately localized and confined to the intended regions.

4. Experimental Results

The method was evaluated on the MIMIC-CXR-JPG and ChexpertPlus datasets, comparing against state-of-the-art baselines like SD-inpainting, PIE, BiomedJourney, and ProgEmu.

Quantitative Performance:
- Pathological Accuracy: Achieved the highest Confidence (Conf) score (0.709), indicating better alignment with the target prompt.
- Image Quality: Achieved the best CLIP-I score (0.870) and competitive FID (29.0) and LPIPS (0.18) scores, demonstrating high realism and distributional fidelity.
Qualitative Analysis:
- Visual comparisons show that the proposed method maintains background stability better than instruction-based editing methods.
- Pathological changes are more accurate and tightly confined to relevant regions compared to inpainting baselines, which often suffer from leakage or structural distortion.
Ablation Study:
- Removing Anatomy Self-Attention Gating significantly dropped structural consistency (SSIM from 0.80 to 0.76).
- Removing Pathology Cross-Attention Regulation drastically reduced pathological accuracy (Conf dropped from 0.71 to 0.66).
- Latent Correction provided a consistent, albeit smaller, improvement in final outcome stability.

5. Significance

This work provides a robust solution for clinically meaningful "what-if" scenarios in medical imaging. By ensuring that counterfactual edits are both anatomically consistent and pathologically precise, the framework supports:

Model Interpretability: Helping clinicians understand how specific diseases alter imaging features without confounding anatomical variations.
Data Augmentation: Generating high-quality, localized synthetic data for training downstream diagnostic models, particularly for rare or underrepresented conditions.
Disease Progression Modeling: Simulating the evolution of diseases in a controlled, patient-specific manner.

The paper demonstrates that careful regulation of attention mechanisms at inference time can overcome the structural instability inherent in diffusion models, offering a scalable alternative to heavy retraining approaches.