Towards Controllable Video Synthesis of Routine and Rare OR Events

Imagine you are trying to teach a robot how to be a perfect operating room (OR) assistant. The robot needs to know how to spot dangerous mistakes, like a surgeon accidentally stepping on a sterile field, or how to coordinate a complex surgery.

To teach this robot, you need a massive library of videos showing every possible scenario: the boring routine stuff, the rare accidents, and the "what-if" disasters.

Here is the problem: You can't just film these things.

Routine stuff is easy to film, but it's boring.
Rare disasters (like a sterile breach) are incredibly hard to catch on camera because they happen so rarely.
Deliberately causing accidents to film them is unethical and dangerous. You can't tell a surgeon, "Hey, please drop a dirty tool on the patient's open wound so we can film it."

So, researchers are stuck. They have a "data bottleneck"—they can't build smart AI because they don't have enough training videos of the scary, rare stuff.

The Solution: The "Geometric LEGO" Simulator

This paper introduces a clever new tool that acts like a video game engine for the operating room. Instead of filming real people, the system "dreams" up realistic videos using a simple, abstract language: Geometric Primitives (Ellipsoids).

Think of the operating room not as a complex scene with doctors, nurses, and shiny metal tools, but as a simple game of LEGO or Pong:

The Patient is a blue oval.
The Surgeon is a red oval.
The Nurse is a green oval.
The Equipment is a yellow oval.

How It Works (The Three Steps)

The researchers built a three-part machine to turn these simple shapes into realistic movies:

1. The Translator (Geometric Abstraction)
First, the system looks at a real video of an operation. It ignores the faces, the scrubs, and the blood. Instead, it translates the scene into our "LEGO" language. It draws an oval around the surgeon and tracks where that oval moves. It's like turning a complex movie into a simple stick-figure animation.

2. The Director (Conditioning Module)
This is the magic part. Because the scene is now just simple shapes, a human can easily "direct" the movie.

Scenario A (Routine): The system can replay a known routine by just moving the ovals along their original paths.
Scenario B (The "What-If"): This is the game-changer. A user can grab the "Surgeon Oval" with their mouse and drag it to a new path. Maybe they drag the surgeon oval too close to the patient oval. The system understands: "Oh, you want to see what happens if the surgeon gets too close to the sterile field?"

3. The Artist (Diffusion Model)
Once the "Director" has set the path for the ovals, the "Artist" (a powerful AI called a Diffusion Model) takes over. It looks at the simple moving ovals and says, "I know what a surgeon looks like. I know what a sterile field looks like. I will now paint a hyper-realistic video that matches these moving shapes."

It fills in the details: the texture of the scrubs, the gleam of the metal, the lighting, and the realistic movement of the arms, all while strictly following the path you drew with the ovals.

Why This Is a Big Deal

The researchers tested this by creating a fake dataset of "near-miss" accidents (times when someone almost broke the sterile rules but didn't quite touch).

The Result: They trained a new AI detector on these fake, generated videos.
The Success: When they tested this AI, it could spot dangerous near-misses 70% of the time.

This proves that you don't need to film real accidents to teach AI how to spot them. You can just draw the shapes on a screen, let the AI "imagine" the realistic video, and use that to train safety systems.

The Analogy Summary

Imagine you want to train a self-driving car to handle a specific, rare accident (like a deer jumping out in front of a bus). You can't wait for it to happen, and you can't crash a real bus to test it.

Instead, you use a simulator. You draw a simple box for the bus and a simple circle for the deer. You move the circle into the path of the box. The simulator then renders a photorealistic video of a bus and a deer crashing, based on your simple drawing. You use that video to teach the car's brain.

This paper does exactly that for operating rooms. It turns complex, dangerous medical scenarios into simple, movable shapes, allowing us to generate infinite "what-if" training videos safely, ethically, and on demand.

1. Problem Statement

The development of Ambient Intelligence (AI) for Operating Rooms (OR) to detect, understand, and mitigate safety-critical events (e.g., sterile-field breaches) is hindered by a critical data bottleneck.

Scarcity of Rare Events: Safety-critical or atypical events (like near-misses or protocol violations) are inherently rare in clinical practice.
Ethical and Operational Barriers: Deliberately inducing these events for data collection is ethically impermissible due to patient safety risks. Furthermore, manual curation and staged reenactments are not scalable due to clinical variability, staffing limits, and operational disruption.
Limitations of Existing Generative Models: Current video diffusion models (e.g., Stable Video Diffusion, WAN) rely on text prompts or single keyframes, which lack the fine-grained control necessary to dictate specific object positions, orientations, and interactions required for simulating complex surgical workflows.

2. Methodology

The authors propose an OR Video Diffusion Framework that reformulates video generation as a controlled task conditioned on abstract geometric scene representations. The framework consists of three core modules:

A. Geometric Abstraction Module

This module transforms raw OR video frames into a simplified, implicit scene graph ( $G = (V, E)$ ):

Entity Representation: OR personnel, patients, and equipment are represented as 3D ellipsoids.
Feature Encoding: Each ellipsoid node encodes:
- Geometric Attributes: 2D centroid position, 3D spatial spread (height, width, rotation), and normalized relative depth.
- Class Information: Encoded via Red/Green color channels (based on 36 semantic classes from the MMOR dataset).
- Depth: Encoded via the Blue channel intensity.
Pipeline: Utilizes SAM2 (for instance segmentation) and Video Depth Anything (for depth estimation) to extract these parameters from the initial scene.

B. Conditioning Module

This module generates a temporal sequence of abstract geometric scenes to guide the diffusion process. It supports two pathways:

Routine Events: Derives trajectories from known OR event templates.
Counterfactual/Rare Events: Allows user-defined trajectories. An interactive tool (built with OpenCV/Pygame) enables users to "drag and drop" ellipsoids to sketch desired movement paths (e.g., moving a non-sterile assistant closer to a sterile field). These trajectories are interpolated across the video sequence to create the conditioning input.

C. Diffusion Module

Backbone: Uses LTX-Video, a transformer-based latent video diffusion model.
Training Strategy: Fine-tuned using In-Context LoRA (IC-LoRA). The model is conditioned on the initial OR frame and the sequence of rendered abstract geometric representations.
Enhancement: A PatchGAN loss is integrated during fine-tuning to improve local realism and fidelity, addressing blurriness in generated entities.

3. Key Contributions

Novel Framework: Introduction of a geometry-conditioned diffusion framework that enables controlled, scalable synthesis of OR events using ellipsoid-based entity representation and trajectory sketches.
Synthesis of Rare Events: Demonstration of the ability to generate counterfactual, rare, and safety-critical scenarios (e.g., sterile-field violations) that are impossible or unethical to collect in real life.
Synthetic Data for Downstream AI: Curation of a synthetic dataset to train AI models for detecting "near-miss" sterile-field violations, achieving high recall without real-world risk.
Performance Optimization: Augmentation of baseline fine-tuning with PatchGAN loss to significantly enhance video fidelity.

4. Results

Quantitative Performance

Baseline Comparison: The proposed method outperforms off-the-shelf baselines (SVD, WAN, LTX-Base) on both in-domain (MMOR) and out-of-domain (4DOR) datasets.
- Metrics: Achieved lower FVD (689.88 vs. 2439.33 for LTX-Base on MMOR) and LPIPS, and higher SSIM (0.86) and PSNR (23.21).
- Controllability: High Bounding Box IoU (0.96) and Segmentation IoU (0.95) indicate strong alignment between the user-defined geometric conditioning and the generated video content.
Rare Event Synthesis: Outperformed the DragNUWA baseline (a sketch-conditioned model) in generating rare/atypical events, achieving a DOVER score of 0.52.

Downstream Task (Near-Miss Detection)

A model trained on the generated synthetic data (87 videos, 678 training frames) was tested on detecting near-misses of sterile-field violations.
ViT-B/16 achieved a Recall of 70.13%, demonstrating the utility of synthetic data for training safety-critical detection models.

Ablation Study

Geometry vs. Segmentation: While direct segmentation mask conditioning yielded slightly better reconstruction metrics (FVD 347.88), it lacked controllability. The ellipsoid-based approach maintained strong performance (FVD 487.20 with PatchGAN) while enabling flexible trajectory manipulation.
PatchGAN Loss: The inclusion of PatchGAN loss significantly improved FVD and segmentation IoU, confirming its role in enhancing local fidelity.

5. Significance and Future Directions

Scalable Data Curation: This work provides a scalable solution to the "data scarcity" problem in surgical AI, enabling the generation of diverse, safety-critical training data without ethical concerns.
Ambient Intelligence: It lays the groundwork for robust ambient intelligence systems that can proactively detect and mitigate surgical risks.
Limitations:
- Granularity: The ellipsoid representation limits fine-grained articulation control (e.g., specific arm movements), relying on the model to implicitly learn interactions.
- Generalization: Performance may degrade in environments significantly different from the training data (e.g., different sterile attire colors or emergency trauma settings).
- Clinical Validation: Formal evaluation by independent surgeons is required to validate the clinical utility of the generated videos.

Conclusion: The paper successfully demonstrates that abstract geometric conditioning allows for the controllable synthesis of complex OR workflows. This approach bridges the gap between the need for large-scale, diverse training data and the ethical constraints of real-world surgical data collection.