TopoOR: A Unified Topological Scene Representation for the Operating Room

Imagine you are trying to understand a complex dance performance, like a ballet.

The Old Way (Current Technology):
Most current computer programs try to understand this dance by looking at pairs of dancers. They ask: "Is the Lead Dancer holding the Follower?" or "Is the Musician playing near the Stage?" They draw a simple line (a graph) between two people at a time.

The problem? A ballet isn't just a series of one-on-one interactions. It's a group effort where five people move in a specific, synchronized pattern to create a single moment. If you only look at pairs, you miss the "group magic." You lose the context of the whole scene. It's like trying to understand a symphony by only listening to the violin and the drum separately, ignoring how they play together.

The New Way (TopoOR):
The paper introduces TopoOR, a new way for computers to "see" and understand a surgical operating room (OR). Instead of just drawing lines between pairs, TopoOR builds a 3D, multi-layered web that captures the whole group dynamic at once.

Here is how it works, using simple analogies:

1. The "Lego" vs. The "Molecule"

Old Method: Think of the operating room as a pile of loose Lego bricks. The computer looks at two bricks and says, "These two are touching." It misses the fact that those bricks are part of a specific castle shape.
TopoOR: TopoOR sees the entire castle. It understands that the Surgeon, the Robot, the Saw, and the Patient aren't just separate items; they are part of a single, complex "molecule" of action happening right now. It treats the whole group interaction as one solid unit.

2. The "Traffic Control" Analogy

In a busy operating room, everyone is moving:

The Surgeon is guiding the Robot.
The Robot is holding a Saw.
The Saw is cutting the Patient's bone.
The Nurse is watching the Monitor.

If you only look at "Surgeon + Robot," you miss the fact that the Robot is also holding the Saw, which is touching the Patient.
TopoOR acts like a super-smart traffic controller. It doesn't just track individual cars; it tracks the entire traffic flow, the intersection, and the pedestrians all at once. It understands that if the Surgeon moves left, the Robot must move left, and the Saw must follow, all while the Nurse watches the screen.

3. Keeping the "Flavor" of Each Sense

Operating rooms are messy with different types of data:

Video (what the camera sees).
Audio (what the microphones hear).
Robot Logs (what the machine is thinking).
3D Movement (where people are standing).

The Old Problem: Previous AI models tried to force all these different things into one giant "soup" (a single list of numbers). It's like trying to mix oil, water, and sand into one smoothie. You lose the texture of the oil and the crunch of the sand. The computer gets confused and loses important details.

The TopoOR Solution: TopoOR keeps the "flavor" of each sense separate but connected. It keeps the audio as audio and the video as video, but it builds a special bridge (called a Higher-Order Attention Network) that lets them talk to each other without mixing them up. It's like having a team of specialists (a chef, a musician, a mechanic) sitting around a table, each keeping their own tools, but working together to solve a problem.

Why Does This Matter?

The authors tested this on a real dataset of surgeries and found TopoOR is much better at three critical things:

Spotting Mistakes (Sterility Breach): If a non-sterile person (like a technician) gets too close to the sterile patient, TopoOR catches it immediately because it understands the group space, not just individual positions.
Predicting the Next Move: It can guess what the surgeon will do next better than old models because it understands the flow of the group action.
Knowing the Phase: It knows exactly what "chapter" of the surgery is happening (e.g., "Calibrating the Robot" vs. "Cutting Bone") because it sees the whole picture, not just fragments.

The Bottom Line

TopoOR is like upgrading from a black-and-white, two-dimensional sketch of a surgery to a full-color, 3D, real-time hologram that understands how everyone and everything interacts as a team.

By respecting the complex, "group" nature of surgery instead of breaking it down into simple pairs, TopoOR makes the AI safer, smarter, and more ready to help doctors in the real world.

Here is a detailed technical summary of the paper "TopoOR: A Unified Topological Scene Representation for the Operating Room."

1. Problem Statement

The paper addresses the limitations of current Surgical Scene Graphs (SSGs) and Vision-Language Models (VLMs) in modeling the complexity of Operating Rooms (OR).

Dyadic Limitation: Existing paradigms rely on strictly pairwise (dyadic) interactions (e.g., Surgeon–Robot). This artificially fragments unified, multi-actor loops (e.g., a surgeon guiding a robot saw based on monitor feedback) into isolated links, stripping away joint spatial and kinematic constraints.
Manifold Flattening: Multimodal OR data (3D human motion in SE(3), robot kinematics, audio, RGB) resides on distinct geometric manifolds. Current methods often force this heterogeneous, non-Euclidean data into a single joint latent space (tokenization). This "flattens" the manifold geometry, discarding critical metric and topological structures essential for safety-critical reasoning.
Semantic Gap: There is a disconnect between the theoretical ambition of holistic scene understanding and practical implementation, where complex group dynamics are lost in favor of simplified graph structures.

2. Methodology

The authors propose TopoOR, a framework rooted in Algebraic Topology and Topological Deep Learning (TDL) to model the OR as a Combinatorial Complex (CC) rather than a standard graph.

A. Combinatorial Complex (CC) Construction

Instead of nodes and edges, the OR is modeled as a hierarchy of "cells" with different ranks:

Rank-0 ( $X_0$ ): Atomic entities. Includes human anatomical joints (3D pose), 3D-localized objects (tools, patient), and auxiliary evidence nodes (audio, robot logs, screen captures).
Rank-1 ( $X_1$ ): Pairwise interactions. Includes intra-entity skeleton edges, dynamic spatial edges (proximity-based), and domain-specific semantic links (e.g., Technician–Robot).
Rank-2 ( $X_2$ ): Higher-order behavior. These are "hypercells" that encapsulate irreducible group dynamics (e.g., a complex involving {Surgeon, Robot, Saw, Patient}). This allows the model to capture polyadic interactions that cannot be decomposed into pairs.

B. Higher-Order Attention Network (HAT)

To reason over this structure, TopoOR employs a Higher-Order Attention Network (HAT):

Mechanism: Unlike standard Graph Attention Networks (GAT) that pass messages only between neighbors, HAT distributes and aggregates messages across the incidence structure of the complex.
Information Flow:
- Boundary cells (lower rank) propagate entity features upward.
- Co-boundary cells (higher rank) distribute aggregated group context downward.
Rank-Pair Bias: A learnable bias term ( $b_{rk(y), rk(x)}$ ) is added to the attention mechanism. This ensures the model respects the topological relationship between source and target cells, preserving the structural origin of features (e.g., distinguishing between human kinematics and aggregated group behavior).
Multimodal Integration: The network processes heterogeneous data (3D poses, audio, logs) without collapsing them into a single latent space, maintaining their distinct geometric properties.

C. Implementation & Training

Initialization: Uses frozen perception modules (COMPOSE for 3D pose, DepthAnythingv3 for depth) to initialize 3D entities without manual annotation.
Temporal Modeling: Consecutive frames are linked via bidirectional temporal edges to form a spatio-temporal complex.
Multi-Task Learning: The model is trained end-to-end for:
1. Next Action Anticipation.
2. Robot Phase Prediction.
3. Sterility Breach Detection: Implemented via rule-based heuristics directly on the 3D topological structure (checking proximity between sterile and non-sterile entities).

3. Key Contributions

Topological Framework (TopoOR): Introduces a unified representation of surgical scenes as higher-order structures, moving beyond dyadic graphs to capture irreducible polyadic dynamics.
Higher-Order Attention (HAT): Proposes an attention mechanism that explicitly preserves manifold structure and modality-specific features through hierarchical relational attention, avoiding the "semantic bottleneck" of tokenization.
Expressivity & Subsumption: Demonstrates that the topological representation subsumes traditional scene graphs. When forced to decode into flattened tokenized formats, TopoOR outperforms existing baselines, proving it retains richer relational information.
Safety-Critical Reasoning: Enables precise, rule-based sterility breach detection by maintaining explicit 3D geometric relationships, which is crucial for patient safety.

4. Experimental Results

Experiments were conducted on the MM-OR dataset (multimodal surgical data).

Quantitative Performance (Macro F1-Score):
- Sterility Breach Detection: TopoOR achieved 76.83%, significantly outperforming the VLM-based MM2SG (55.00%) and matching other 3D-grounded methods.
- Next Action Anticipation: TopoOR achieved 41.10%, outperforming the Vanilla Transformer (34.80%) and SurgLatentGraph (37.46%).
- Robot Phase Prediction: TopoOR reached 73.53%, a state-of-the-art result compared to MM2SG (56.90%) and SurgLatentGraph (64.61%).
Ablation Studies:
- Performance improved incrementally with the addition of modalities (Object Nodes $\to$ Human Skeleton $\to$ RGB $\to$ Robot Logs $\to$ Audio $\to$ Temporality).
- Temporal edges provided the most significant boost for Robot Phase Prediction.
Graph Reduction: When reducing the topological complex to a string-based format (simulating a scene graph), TopoOR achieved 61.30% F1 in relation prediction, outperforming the LLM baseline (52.90%), confirming its superior representational power.
Efficiency: TopoOR (12M parameters) runs in ~59ms per forward pass on an A40 GPU, significantly faster than the quantized 7B-parameter MM2SG (~194ms), making it suitable for real-time intraoperative use.

5. Significance

Paradigm Shift: TopoOR challenges the dominance of pairwise graphs and tokenized sequences in surgical data science, arguing that surgical procedures are inherently polyadic (multi-way) and geometric.
Clinical Utility: By preserving the metric and topological structure of the OR, the model supports tasks where geometric precision is non-negotiable (e.g., detecting sterility breaches), offering a safer alternative to "black box" VLMs.
Scalability: The framework is computationally efficient and modular, capable of integrating diverse sensors (audio, vision, kinematics) without degrading performance, paving the way for more robust, context-aware surgical assistants.

In conclusion, TopoOR demonstrates that modeling the operating room as a higher-order topological complex yields superior performance in workflow recognition and safety monitoring by respecting the intrinsic geometric and relational complexity of surgical environments.